0% found this document useful (0 votes)

47 views9 pages

Data Preprocessing, Data Warehousing

Unit 2 covers data preprocessing, data warehousing, and OLAP, essential for effective data mining. It discusses the importance of cleaning and organizing data, the architecture of data warehouses, and the capabilities of OLAP for multidimensional analysis. The unit also highlights challenges and applications in these areas, emphasizing their role in improving decision-making and efficiency in data analysis.

Uploaded by

ANIRUDDHA ADAK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views9 pages

Data Preprocessing, Data Warehousing

Uploaded by

ANIRUDDHA ADAK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit 2: Data Preprocessing, Data Warehousing, and

OLAP (11 Hours)

Overview
Unit 2 introduces the foundational steps in data mining: data preprocessing, data
warehousing, and OLAP (Online Analytical Processing). These topics are crucial
because raw data is often messy, incomplete, or not structured for analysis, and data min-
ing requires high-quality, well-organized data to produce meaningful insights. Data pre-
processing ensures the data is clean and usable, data warehousing provides a centralized
system to store and manage large volumes of data, and OLAP enables multidimensional
analysis for decision-making. This unit, spanning 11 hours, covers techniques to clean
and transform data, design data warehouses, and perform analytical queries, preparing
students for advanced data mining tasks like those in Unit 4 (Mining Data Streams).

1 Data Preprocessing
1.1 What is Data Preprocessing?
• Deﬁnition: Data preprocessing involves cleaning, transforming, and organiz-
ing raw data into a suitable format for mining and analysis.
• Why Its Important: Raw data often contains noise, inconsistencies, missing
values, and irrelevant attributes, which can lead to inaccurate or misleading
results in data mining.

Example: A dataset of online sales might have missing customer ages,

duplicate entries, or inconsistent date formats (e.g., "01/02/2023" vs.
"2023-02-01"). Preprocessing ﬁxes these issues before analysis.

1.2 Challenges in Data Preprocessing

• Heterogeneous Data Sources: Data may come from multiple sources with dif-
ferent formats (e.g., CSV ﬁles, databases, APIs).
• Missing Values: Some records may lack values for key attributes (e.g., missing
income data in a customer dataset).
• Noise: Data may contain errors or outliers (e.g., a persons age listed as 200 years).
• High Dimensionality: Datasets with too many attributes can complicate analysis
(e.g., thousands of features in a genomic dataset).
• Inconsistent Data: Variations in data entry (e.g., "USA" vs. "United States" for
the same country).

1.3 Steps in Data Preprocessing

• Data Cleaning:

1
– What is it?: Fixing or removing incorrect, incomplete, or noisy data.
– Techniques:
∗ Handling Missing Values:
· Ignore the Record: Remove rows with missing values if the dataset is
large.
· Fill with Mean/Median: Replace missing values with the average or
median (e.g., replacing missing ages with the average age of cus-
tomers).
· Predict Missing Values: Use algorithms like k-Nearest Neighbors (k-
NN) to predict missing values based on similar records.
∗ Smoothing Noise: Use techniques like binning, regression, or clustering to
smooth noisy data (e.g., averaging out erratic sales ﬁgures).
∗ Removing Duplicates: Identify and delete duplicate records (e.g., remov-
ing repeated customer entries).
∗ Correcting Inconsistencies: Standardize data formats (e.g., converting all
dates to "YYYY-MM-DD").

Example: In a dataset, a customers age is listed as -5 (an error). Data

cleaning replaces it with the median age of the dataset, say 30.

– Pros: Improves data quality for better mining results.

– Cons: May lead to data loss if too many records are removed.
• Data Integration:
– What is it?: Combining data from multiple sources into a unified dataset.
– Challenges:
∗ Entity Identification: Matching records that refer to the same entity (e.g.,
"John Smith" in one dataset and "J. Smith" in another).
∗ Schema Integration: Aligning different data structures (e.g., one dataset
uses "CustomerID," another uses "ClientID").
∗ Redundancy: Avoiding duplicate attributes (e.g., "Age" and "YearsOld"
might be the same).

Example: Merging sales data from an online store and a physical

store, ensuring "CustomerID" matches across both datasets.

– Pros: Provides a comprehensive view of the data.

– Cons: Can introduce errors if integration is not done carefully.
• Data Transformation:
– What is it?: Converting data into a format suitable for mining.

2
– Techniques:
∗ Normalization: Scaling numeric data to a speciﬁc range, often [0, 1], to
ensure fair comparisons (e.g., scaling income and age to the same range).
∗ Standardization: Transforming data to have a mean of 0 and a standard
deviation of 1 (e.g., standardizing test scores).
∗ Discretization: Converting continuous data into discrete bins (e.g., group-
ing ages into "Young," "Middle-Aged," "Senior").
∗ Encoding: Converting categorical data into numerical form (e.g., mapping
"Male" to 0 and "Female" to 1).

Example: Normalizing a dataset where income ranges from $20,000 to

$100,000 to a [0, 1] scale, so $60,000 becomes 0.5.

– Pros: Makes data compatible with mining algorithms.

– Cons: May lose some information during transformation.
• Data Reduction:
– What is it?: Reducing the size of the dataset while preserving its essential
information.
– Techniques:
∗ Dimensionality Reduction: Removing irrelevant or redundant attributes
using methods like Principal Component Analysis (PCA).
∗ Numerosity Reduction: Replacing data with smaller representations (e.g.,
using histograms instead of raw data).
∗ Data Compression: Compressing data to save space (e.g., storing sales
data as aggregates).
∗ Sampling: Selecting a subset of data (e.g., random sampling to reduce a
million records to 10,000).

Example: Using PCA to reduce a dataset with 100 features to 10 key

features for faster analysis.

– Pros: Speeds up mining and reduces storage needs.

– Cons: May lose some patterns or details.
• Data Discretization:
– What is it?: Converting continuous data into discrete categories.
– Techniques:
∗ Binning: Grouping values into bins (e.g., dividing income into "Low,"
"Medium," "High").

3
∗ Histogram Analysis: Using histograms to deﬁne bins based on data dis-
tribution.
∗ Clustering: Grouping similar values into clusters (e.g., clustering temper-
atures into "Cold," "Warm," "Hot").

Example: Discretizing a temperature dataset into "Cold" (<10řC),

"Warm" (10-25řC), and "Hot" (>25řC).

– Pros: Simpliﬁes data for certain algorithms like decision trees.

– Cons: May reduce precision of the data.

1.4 Applications of Data Preprocessing

• Machine Learning: Preparing data for algorithms like classiﬁcation or clustering
(e.g., cleaning a dataset for a spam email classiﬁer).
• Business Analytics: Ensuring sales data is accurate for forecasting (e.g., remov-
ing outliers from sales records).
• Healthcare: Cleaning patient data for predictive modeling (e.g., handling missing
blood pressure readings).
• Social Media Analysis: Standardizing user data for sentiment analysis (e.g.,
unifying location formats in tweets).

1.5 Challenges in Data Preprocessing

• Time-Consuming: Preprocessing can take up to 80% of the data mining process.
• Data Loss Risk: Aggressive cleaning or reduction may remove important patterns.
• Complexity: Handling large, heterogeneous datasets requires expertise.
• Bias Introduction: Improper preprocessing can introduce biases (e.g., over-sampling
a minority class).

2 Data Warehousing
2.1 What is a Data Warehouse?
• Definition: A centralized repository that stores large volumes of historical data
from multiple sources, optimized for analysis and reporting.
• Characteristics:
– Subject-Oriented: Focuses on specific subjects (e.g., sales, customers) rather
than operational processes.
– Integrated: Combines data from different sources into a consistent format.
– Non-Volatile: Data is stable and not updated in real-time (e.g., historical sales
data isnt changed).

4
– Time-Variant: Stores historical data for long-term analysis (e.g., sales trends
over years).

Example: A retail companys data warehouse stores sales, inventory, and

customer data from all its stores for trend analysis.

2.2 Why Data Warehousing is Important

• Supports Decision-Making: Provides a uniﬁed view of data for strategic deci-
sions (e.g., identifying best-selling products).
• Eﬃcient Querying: Optimized for complex analytical queries, unlike operational
databases.
• Historical Analysis: Enables trend analysis over long periods (e.g., sales patterns
over a decade).
• Data Consolidation: Integrates data from disparate sources (e.g., merging sales
data from online and physical stores).

2.3 Architecture of a Data Warehouse

• Three-Tier Architecture:
– Bottom Tier (Data Sources): Raw data from operational databases, external
sources (e.g., CRM systems, IoT devices).
– Middle Tier (Data Warehouse Server): Stores integrated and cleaned data,
often using a relational database (e.g., Oracle, SQL Server).
– Top Tier (Client Layer): Tools for querying and reporting (e.g., BI tools like
Tableau, Power BI).
• Components:
– ETL Process (Extract, Transform, Load):
∗ Extract: Collect data from various sources.
∗ Transform: Clean and transform data (e.g., standardize formats, remove
duplicates).
∗ Load: Store the transformed data into the warehouse.
– Metadata: Data about the data (e.g., source, format, update time).
– Data Marts: Subsets of the warehouse for speciﬁc departments (e.g., a mar-
keting data mart).

Example: A company extracts sales data from its POS system, transforms
it by cleaning duplicates, and loads it into a data warehouse for analysis.

5
2.4 Data Warehouse Schemas
• Star Schema:
– Structure: A central fact table (e.g., sales) connected to multiple dimension
tables (e.g., time, product, customer).
– Pros: Simple and fast for querying.
– Cons: May lead to redundancy in dimension tables.
• Snowﬂake Schema:
– Structure: Like a star schema, but dimension tables are normalized into sub-
tables (e.g., a "product" table splits into "category" and "subcategory").
– Pros: Reduces redundancy, saves storage.
– Cons: More complex, slower queries due to additional joins.
• Galaxy Schema (Fact Constellation):
– Structure: Multiple fact tables sharing dimension tables (e.g., sales and inven-
tory fact tables sharing a time dimension).
– Pros: Supports complex analysis across multiple subjects.
– Cons: Complex to design and maintain.

Example: A star schema with a sales fact table (containing revenue,

quantity) connected to dimension tables like time (date, month) and product
(name, category).

2.5 Challenges in Data Warehousing

• Data Integration: Combining data from heterogeneous sources can lead to in-
consistencies.
• Scalability: Warehouses must handle growing data volumes (e.g., terabytes of
historical data).
• ETL Complexity: Extracting, transforming, and loading large datasets is resource-
intensive.
• Data Quality: Poor-quality data in the warehouse can lead to unreliable insights.
• Cost: Building and maintaining a data warehouse is expensive (e.g., hardware,
software, personnel).

2.6 Applications of Data Warehousing

• Business Intelligence: Generating reports and dashboards (e.g., sales perfor-
mance reports).
• Trend Analysis: Identifying long-term patterns (e.g., seasonal sales trends).
• Forecasting: Predicting future outcomes (e.g., demand forecasting for inventory).

6
• Market Research: Analyzing customer behavior across regions and time periods.

3 OLAP (Online Analytical Processing)

3.1 What is OLAP?
• Deﬁnition: OLAP is a technology that enables multidimensional analysis of
data in a data warehouse, allowing users to perform complex queries for decision-
making.
• Key Features:
– Multidimensional View: Data is organized into dimensions (e.g., time, prod-
uct) and measures (e.g., sales revenue).
– Fast Querying: Optimized for analytical queries, not transactional updates.
– Interactive Analysis: Users can slice, dice, drill down, or roll up data interac-
tively.

Example: A manager uses OLAP to analyze sales data by region, product,

and month to identify top-performing regions.

3.2 Types of OLAP Systems

• MOLAP (Multidimensional OLAP):
– What is it?: Stores data in multidimensional cubes (e.g., a cube with dimen-
sions time, product, region).
– Pros: Fast query performance due to precomputed aggregates.
– Cons: Limited scalability for very large datasets.
• ROLAP (Relational OLAP):
– What is it?: Uses relational databases to store data, performing multidimen-
sional analysis via SQL queries.
– Pros: Scales well for large datasets.
– Cons: Slower query performance compared to MOLAP.
• HOLAP (Hybrid OLAP):
– What is it?: Combines MOLAP and ROLAP, storing detailed data in a rela-
tional database and aggregates in a cube.
– Pros: Balances speed and scalability.
– Cons: More complex to implement.

Example: A MOLAP system precomputes sales aggregates for quick

retrieval, while a ROLAP system queries raw sales data dynamically.

7
3.3 OLAP Operations
• Drill-Down: Zooming into more detailed data (e.g., from yearly sales to monthly
sales).
• Roll-Up: Aggregating data to a higher level (e.g., from monthly sales to yearly
sales).
• Slice: Selecting one dimension to focus on (e.g., sales for a specific year).
• Dice: Selecting a subset of dimensions (e.g., sales for specific years and regions).
• Pivot: Rotating the data axes to view it from different perspectives (e.g., switching
rows and columns in a report).

Example: Drilling down from total sales in 2024 to sales by quarter, then
slicing to see Q1 sales in the USA.

3.4 OLAP vs. OLTP (Online Transaction Processing)

• OLTP:
– Purpose: Handles day-to-day transactions (e.g., updating a customers order
in a database).
– Characteristics: Real-time updates, small transactions, normalized data.
• OLAP:
– Purpose: Analytical queries for decision-making (e.g., analyzing sales trends).
– Characteristics: Read-heavy, complex queries, denormalized data.

Example: OLTP updates a bank transaction in real-time, while OLAP

analyzes transaction trends over a year.

3.5 Challenges in OLAP

• Performance: Complex queries on large datasets can be slow without proper
optimization.
• Data Volume: Handling massive data requires eﬃcient storage and indexing.
• Cube Explosion: Precomputing all possible aggregates in MOLAP can lead to
storage issues.
• Data Freshness: OLAP systems often use historical data, which may not reﬂect
recent changes.

3.6 Applications of OLAP

• Business Reporting: Generating sales, ﬁnancial, or inventory reports (e.g., quar-
terly sales analysis).

8
• Forecasting: Predicting future trends (e.g., predicting next years sales based on
historical data).
• Market Analysis: Analyzing customer demographics and buying patterns.
• Budgeting: Planning budgets based on historical spending patterns.

4 Importance of Data Preprocessing, Warehousing, and OLAP

• Foundation for Data Mining: Preprocessing ensures high-quality data, ware-
housing provides a structured repository, and OLAP enables analytical queries, all
of which are prerequisites for advanced tasks like mining data streams (Unit 4).
• Improved Decision-Making: Clean data, centralized storage, and multidimen-
sional analysis lead to better business decisions.
• Eﬃciency: Reduces errors and speeds up the mining process by starting with
well-prepared data.
• Scalability: Warehouses and OLAP systems handle large datasets, making them
suitable for big data applications.

5 Challenges in Data Preprocessing, Warehousing, and OLAP

• Data Quality: Poor-quality data affects all stages, from preprocessing to OLAP
analysis.
• Complexity: Designing and maintaining warehouses and OLAP systems requires
expertise.
• Resource Intensive: Preprocessing, ETL processes, and OLAP querying demand
significant computational resources.
• Evolving Data Needs: Businesses constantly change their analytical needs, re-
quiring flexible systems.

Conclusion
Unit 2 lays the groundwork for data mining by covering data preprocessing, data ware-
housing, and OLAP. These topics ensure that raw data is cleaned, organized, and stored
eﬀectively, enabling multidimensional analysis for decision-making. The 11-hour duration
allows for an in-depth exploration of techniques like data cleaning, ETL processes, and
OLAP operations, preparing students for real-world applications in business intelligence,
trend analysis, and forecasting. By mastering these concepts, students build a strong
foundation for advanced data mining tasks, such as mining data streams in Unit 4, and
can handle the complexities of large-scale data analysis in modern systems.

Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Unit 3
No ratings yet
Unit 3
22 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Warehouse and Data Mining Syllabus
No ratings yet
Data Warehouse and Data Mining Syllabus
5 pages
CH 3
No ratings yet
CH 3
68 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Unit 2
No ratings yet
Unit 2
144 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
185 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
17 pages
DWM Q Bank
No ratings yet
DWM Q Bank
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
78% (9)
How To Write Track 1 and 2 Dumps With Pin PitDumps EMV Software PDF
2 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
DMDW - Preprocessing L-6,7
No ratings yet
DMDW - Preprocessing L-6,7
16 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Computer Networks Organizer 2024 by Aniruddha Adak
No ratings yet
Computer Networks Organizer 2024 by Aniruddha Adak
136 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
17 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Mining
No ratings yet
Data Mining
55 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Data Science Notes
No ratings yet
Data Science Notes
59 pages
DWDM
No ratings yet
DWDM
11 pages
Data Mining
No ratings yet
Data Mining
4 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
IQAN-MD4 Instructionbook UK
No ratings yet
IQAN-MD4 Instructionbook UK
45 pages
Resume 1
100% (1)
Resume 1
106 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Image Processing Organizer 2024 by Aniruddha Adak
No ratings yet
Image Processing Organizer 2024 by Aniruddha Adak
128 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Mininng
No ratings yet
Data Mininng
11 pages
Datadwm 1
No ratings yet
Datadwm 1
8 pages
? Data Preprocessing
No ratings yet
? Data Preprocessing
19 pages
Annihilator Method
100% (1)
Annihilator Method
7 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Internship
No ratings yet
Internship
12 pages
Project 12
No ratings yet
Project 12
44 pages
Human Resource Development Organisational Behaviour Organizer (For B.Tech MAKAUT)
No ratings yet
Human Resource Development Organisational Behaviour Organizer (For B.Tech MAKAUT)
121 pages
Human Resource Development Organisational Behaviour Organizer
No ratings yet
Human Resource Development Organisational Behaviour Organizer
121 pages
Discrete Math for CS Students
No ratings yet
Discrete Math for CS Students
46 pages
UserGuide10 PDF
No ratings yet
UserGuide10 PDF
494 pages
DBMS Organizer 2024 by Aniruddha Adak
No ratings yet
DBMS Organizer 2024 by Aniruddha Adak
160 pages
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
No ratings yet
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
16 pages
Data Warehousing & Data Mining Organizer (For B.Tech MAKAUT)
No ratings yet
Data Warehousing & Data Mining Organizer (For B.Tech MAKAUT)
97 pages
Destination Management System
100% (1)
Destination Management System
14 pages
Schema Masina de Spalat Indesit
100% (2)
Schema Masina de Spalat Indesit
31 pages
Hailey College of Commerce Punjab University, Lahore: Assignment: A.I.S (Oracle) Submited To
No ratings yet
Hailey College of Commerce Punjab University, Lahore: Assignment: A.I.S (Oracle) Submited To
6 pages
FORM R.1 Recognition Application Form
No ratings yet
FORM R.1 Recognition Application Form
9 pages
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
No ratings yet
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
600 pages
SBI 7126T S6 Blade Module
No ratings yet
SBI 7126T S6 Blade Module
74 pages
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
No ratings yet
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
27 pages
Least Mastered Competency: Consolidated
No ratings yet
Least Mastered Competency: Consolidated
2 pages
Mandarine Log
No ratings yet
Mandarine Log
37 pages
Concurrent Managers Not Working Check This
No ratings yet
Concurrent Managers Not Working Check This
16 pages
77 9097
No ratings yet
77 9097
75 pages
Hazardous in Underground Mines
No ratings yet
Hazardous in Underground Mines
26 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Unit 30 - Assignment 1
100% (1)
Unit 30 - Assignment 1
3 pages
Research Methodology Guide - by Aniruddha Adak
No ratings yet
Research Methodology Guide - by Aniruddha Adak
24 pages
S-1206 Series: Ultra Low Current Consumption and Low Dropout Cmos Voltage Regulator
No ratings yet
S-1206 Series: Ultra Low Current Consumption and Low Dropout Cmos Voltage Regulator
35 pages
Research Methodology Guide For Beginners A Detailed and Colorful Exploration of Research Concepts by Aniruddha Adak
No ratings yet
Research Methodology Guide For Beginners A Detailed and Colorful Exploration of Research Concepts by Aniruddha Adak
24 pages
Why Do Students Like Online Learning
No ratings yet
Why Do Students Like Online Learning
2 pages
FRST
No ratings yet
FRST
19 pages
EIM Performance Tuning Guide
No ratings yet
EIM Performance Tuning Guide
3 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
Sem6 Old Syllebus by Aniruddha Adak
No ratings yet
Sem6 Old Syllebus by Aniruddha Adak
14 pages
1 Line Definition For All Subject Topics Include DBMS, CN, IP, Data Mining, OB, RM
No ratings yet
1 Line Definition For All Subject Topics Include DBMS, CN, IP, Data Mining, OB, RM
10 pages
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
No ratings yet
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
16 pages
Sem 6 Syllebus Grok
No ratings yet
Sem 6 Syllebus Grok
9 pages
Grok Human Resource Development and Ob
No ratings yet
Grok Human Resource Development and Ob
11 pages
Spiceman FONT
No ratings yet
Spiceman FONT
10 pages
99 Ta 516149
No ratings yet
99 Ta 516149
2 pages
ANIRUDDHA ADAK - 27600122030 (For B.Tech MAKAUT)
No ratings yet
ANIRUDDHA ADAK - 27600122030 (For B.Tech MAKAUT)
7 pages
Smart Care
No ratings yet
Smart Care
47 pages
FAX236S Brochure 2
No ratings yet
FAX236S Brochure 2
1 page
Sem 6 Syllebus Grok-5-6
No ratings yet
Sem 6 Syllebus Grok-5-6
2 pages
Nour Abdelhafiz CV
No ratings yet
Nour Abdelhafiz CV
2 pages
Aniruddha Adak - Software Developer Skills Resume
No ratings yet
Aniruddha Adak - Software Developer Skills Resume
1 page
Sem 6 Admit Card by Aniruddha Adak
No ratings yet
Sem 6 Admit Card by Aniruddha Adak
1 page
Makaut 6th Sem Exam Form by Aniruddha Adak
No ratings yet
Makaut 6th Sem Exam Form by Aniruddha Adak
1 page
BPM Strategies for Enterprises
No ratings yet
BPM Strategies for Enterprises
10 pages

Data Preprocessing, Data Warehousing

Uploaded by

Data Preprocessing, Data Warehousing

Uploaded by

Unit 2: Data Preprocessing, Data Warehousing, and

OLAP (11 Hours)

Example: A dataset of online sales might have missing customer ages,

1.2 Challenges in Data Preprocessing

1.3 Steps in Data Preprocessing

Example: In a dataset, a customers age is listed as -5 (an error). Data

– Pros: Improves data quality for better mining results.

Example: Merging sales data from an online store and a physical

– Pros: Provides a comprehensive view of the data.

Example: Normalizing a dataset where income ranges from $20,000 to

– Pros: Makes data compatible with mining algorithms.

Example: Using PCA to reduce a dataset with 100 features to 10 key

– Pros: Speeds up mining and reduces storage needs.

Example: Discretizing a temperature dataset into "Cold" (<10řC),

– Pros: Simpliﬁes data for certain algorithms like decision trees.

1.4 Applications of Data Preprocessing

1.5 Challenges in Data Preprocessing

Example: A retail companys data warehouse stores sales, inventory, and

2.2 Why Data Warehousing is Important

2.3 Architecture of a Data Warehouse

Example: A star schema with a sales fact table (containing revenue,

2.5 Challenges in Data Warehousing

2.6 Applications of Data Warehousing

3 OLAP (Online Analytical Processing)

Example: A manager uses OLAP to analyze sales data by region, product,

3.2 Types of OLAP Systems

Example: A MOLAP system precomputes sales aggregates for quick

3.4 OLAP vs. OLTP (Online Transaction Processing)

Example: OLTP updates a bank transaction in real-time, while OLAP

3.5 Challenges in OLAP

3.6 Applications of OLAP

4 Importance of Data Preprocessing, Warehousing, and OLAP

5 Challenges in Data Preprocessing, Warehousing, and OLAP

You might also like