Unit 1
๐ฏ Unit 1 Overview
Unit 1 covers the basic concepts of data warehousing. It includes data warehouse architecture,
delivery process, data preprocessing, cleaning, integration, transformation, reduction,
data warehouse schema, partitioning, data marts, metadata and multidimensional data model.
Exam Tip: Data warehouse architecture, data preprocessing, data mart, metadata and multidimensional data model are very important for RGPV exams.
๐ข Data Warehousing
A data warehouse is a large centralized repository that stores historical data collected from
different sources. It is mainly used for analysis, reporting and decision making.
Simple Meaning
Data warehouse ek aisa storage system hota hai jisme organization ka old aur current data
analysis ke liye store kiya jata hai.
Characteristics
- Subject-oriented
- Integrated
- Time-variant
- Non-volatile
- Used for decision support
๐ Data Warehouse Delivery Process
Data warehouse delivery process describes how data is collected, cleaned, transformed and loaded
into the warehouse.
- Requirement analysis
- Data source identification
- Data extraction
- Data cleaning
- Data transformation
- Data loading
- Reporting and analysis
- Maintenance and updates
๐๏ธ Data Warehouse Architecture
Data warehouse architecture defines the structure of components used to collect, store and analyze data.
Main Components
- Data Sources: Operational databases, files, logs and external sources.
- ETL Tools: Extract, transform and load data.
- Data Warehouse Storage: Central storage for processed data.
- Metadata Repository: Stores information about data.
- OLAP Server: Supports multidimensional analysis.
- Reporting Tools: Used for reports, dashboards and decision making.
Diagram me flow likho: Data Sources โ ETL โ Data Warehouse โ OLAP/Reports โ Users.
๐งน Data Preprocessing
Data preprocessing is the process of preparing raw data for analysis. Raw data may contain errors,
missing values, duplicate values and inconsistent formats.
Major Steps
- Data cleaning
- Data integration
- Data transformation
- Data reduction
โ
Data Cleaning
Data cleaning removes errors and inconsistencies from data.
Tasks
- Handling missing values
- Removing duplicate records
- Correcting inconsistent data
- Removing noisy data
- Fixing incorrect formats
๐ Data Integration and Transformation
Data Integration
Data integration combines data from multiple sources into a single consistent view.
Data Transformation
Data transformation converts data into suitable format for storage and analysis.
| Process |
Purpose |
| Integration |
Combines data from different sources. |
| Transformation |
Converts data into required format. |
| Normalization |
Scales data into standard range. |
| Aggregation |
Summarizes data for analysis. |
๐ Data Reduction
Data reduction reduces the volume of data while maintaining useful information.
Techniques
- Dimensionality reduction
- Numerosity reduction
- Data compression
- Sampling
- Aggregation
๐งฑ Data Warehouse Design
Data warehouse design includes selecting schema, data model, storage structure and analysis requirements.
Important Design Points
- Identify business requirements
- Select subject areas
- Choose fact and dimension tables
- Design schema
- Plan ETL process
- Decide partitioning strategy
- Define security and backup policies
โญ Data Warehouse Schema
| Schema |
Description |
| Star Schema |
One fact table connected with multiple dimension tables. |
| Snowflake Schema |
Dimension tables are further normalized into sub-dimension tables. |
| Fact Constellation Schema |
Multiple fact tables share dimension tables. |
๐ฆ Partitioning Strategy
Partitioning means dividing large data into smaller parts for better performance and management.
Types
- Horizontal partitioning
- Vertical partitioning
- Range partitioning
- Hash partitioning
- List partitioning
Benefits
- Improves query performance
- Easy data management
- Faster backup and recovery
- Better scalability
๐ฌ Data Marts
A data mart is a smaller part of a data warehouse designed for a specific department or business area.
Examples
- Sales data mart
- Finance data mart
- Marketing data mart
- HR data mart
Data warehouse poori organization ke liye hota hai, Data mart kisi specific department ke liye hota hai.
๐งพ Metadata
Metadata means data about data. It describes the structure, source, meaning and usage of data.
Examples
- Table name
- Column name
- Data type
- Source of data
- Data owner
- Last updated date
๐ Multidimensional Data Model
A multidimensional data model represents data in the form of dimensions and measures.
It is used for OLAP and analytical processing.
Important Terms
- Fact: Numeric data to analyze, such as sales amount.
- Dimension: Perspective of analysis, such as time, product and location.
- Measure: Quantitative value used for analysis.
- Data Cube: Multidimensional representation of data.
๐งฉ Introduction to Pattern Warehousing
Pattern warehousing stores discovered patterns from data mining processes. These patterns help in
future analysis, decision making and knowledge discovery.
Uses
- Stores frequent patterns
- Supports knowledge discovery
- Helps in decision support
- Improves reuse of discovered knowledge
โ๏ธ Data Warehouse vs Data Mart
| Data Warehouse |
Data Mart |
| Used for entire organization. |
Used for specific department. |
| Large in size. |
Smaller in size. |
| Contains data from many areas. |
Contains focused data. |
| Complex to design. |
Easier to design. |
โญ Important Questions
- Define data warehouse and explain its characteristics.
- Explain data warehouse architecture with diagram.
- Explain data warehouse delivery process.
- Explain data preprocessing and its steps.
- Explain data cleaning, integration, transformation and reduction.
- Explain star schema, snowflake schema and fact constellation schema.
- What is partitioning strategy? Explain its benefits.
- Define data mart and metadata.
- Explain multidimensional data model.
- Differentiate between data warehouse and data mart.
๐ฅ Last Minute Revision
- Data warehouse stores historical data for analysis.
- ETL = Extract, Transform, Load.
- Data preprocessing prepares raw data for mining.
- Data cleaning removes errors and missing values.
- Star schema has one fact table and many dimension tables.
- Data mart is department-level warehouse.
- Metadata means data about data.
- Multidimensional model uses facts, dimensions and measures.