0% found this document useful (0 votes)

208 views42 pages

Unit 2 - Data Preprocessing

This document provides an overview of data preprocessing techniques for data mining. It discusses why data preprocessing is important, common reasons why real-world data is dirty or incomplete, and major tasks in data preprocessing including data cleaning, integration, transformation, reduction, and discretization. Specific techniques are described for handling missing data, noisy data, and data integration. Data transformation techniques like normalization, aggregation, and attribute construction are also covered. The goal of data preprocessing is to prepare raw data for data mining by cleaning noise and inconsistencies, filling in missing values, and reducing data volume so the results of data mining algorithms are more accurate and useful.

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

208 views42 pages

Unit 2 - Data Preprocessing

Uploaded by

vikasbhowate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 42

St.

Vincent Pallotti College of Engineering

& Technology

Data Warehousing and Mining

(BEIT701T)
7th Sem B.E. (IT)
Presented By

Samir Siddiqui
CR FINAL YEAR IT
Department of Information Technology

1
December 22, 2022 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking

certain attributes of interest, or containing

only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
December 22, 2022 Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
December 22, 2022 Data Mining: Concepts and Techniques 4
Why Is Data Preprocessing Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

December 22, 2022 Data Mining: Concepts and Techniques 5

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

December 22, 2022 Data Mining: Concepts and Techniques 6

December 22, 2022 Data Mining: Concepts and Techniques 7
Forms of Data Preprocessing

December 22, 2022 Data Mining: Concepts and Techniques 8

How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree

December 22, 2022 Data Mining: Concepts and Techniques 9

December 22, 2022 Data Mining: Concepts and Techniques 10
December 22, 2022 Data Mining: Concepts and Techniques 11
December 22, 2022 Data Mining: Concepts and Techniques 12
December 22, 2022 Data Mining: Concepts and Techniques 13
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

December 22, 2022 Data Mining: Concepts and Techniques 14

Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
December 22, 2022 Data Mining: Concepts and Techniques 15
Regression

Y1’ y=x+1

X1 x

December 22, 2022 Data Mining: Concepts and Techniques 16

Cluster Analysis

December 22, 2022 Data Mining: Concepts and Techniques 17

December 22, 2022 Data Mining: Concepts and Techniques 18
December 22, 2022 Data Mining: Concepts and Techniques 19
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent

store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton

 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from

different sources are different

 Possible reasons: different representations, different

scales, e.g., metric vs. British units

December 22, 2022 Data Mining: Concepts and Techniques 20

December 22, 2022 Data Mining: Concepts and Techniques 21
December 22, 2022 Data Mining: Concepts and Techniques 22
December 22, 2022 Data Mining: Concepts and Techniques 23
December 22, 2022 Data Mining: Concepts and Techniques 24
Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones

December 22, 2022 Data Mining: Concepts and Techniques 25

Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
1.0]. Then $73,000 is mapped to 98,000  12,000 (1.0  0)  0  0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
December 22, 2022 Data Mining: Concepts and Techniques 26
December 22, 2022 Data Mining: Concepts and Techniques 27
December 22, 2022 Data Mining: Concepts and Techniques 28
December 22, 2022 Data Mining: Concepts and Techniques 29
December 22, 2022 Data Mining: Concepts and Techniques 30
December 22, 2022 Data Mining: Concepts and Techniques 31
December 22, 2022 Data Mining: Concepts and Techniques 32
Data Reduction Strategies

 Why data reduction?

 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time to run

on the complete data set

 Data reduction
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Dimensionality reduction — e.g., remove unimportant attributes

 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

December 22, 2022 Data Mining: Concepts and Techniques 33

Data Cube Aggregation

 The lowest level of a data cube (base cuboid)

 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
December 22, 2022 Data Mining: Concepts and Techniques 34
Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to

understand
 Heuristic methods (due to exponential # of choices):
 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward elimination

 Decision-tree induction

December 22, 2022 Data Mining: Concepts and Techniques 35

Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December 22, 2022 Data Mining: Concepts and Techniques 36

Dimensionality Reduction: Principal
Component Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
 Steps
 Normalize input data: Each attribute falls within the same range

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal

component vectors
 The principal components are sorted in order of decreasing

“significance” or strength
 Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with low

variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large
December 22, 2022 Data Mining: Concepts and Techniques 37
Principal Component Analysis

Y1
Y2

December 22, 2022 Data Mining: Concepts and Techniques 38

Chapter 2: Data Preprocessing

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
December 22, 2022 Data Mining: Concepts and Techniques 39
Summary
 Data preparation or preprocessing is a big issue for both
data warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research
December 22, 2022 Data Mining: Concepts and Techniques 40
Question Bank
 Q1. What is the need of data preprocessing. Explain in brief. [6M][S-17], [6M][W-16], [5M][S-19]
 Q2. Summarize the data preprocessing steps in brief. [7M][W-17], [7M][S-18],
 Q3. What is data cleaning? Explain different methods of data cleaning. [7M][W-17], [6M][W-16]
 Q4. What is data transformation? Explain different methods of transformation[8M][S-17]
 Q5. Write short notes on:
 a. Missing value b. Noisy data c. Cluster d. Outlier
 Q6. Write short note on data cleaning. OR How data cleaning can be can be handled in
preprocessing.[6M][S-18], [3M][S-16]
 Q7. Q.10. What is data reduction? Explain different methods of data reduction. [7M][W-17], [7M]
[S-18], [4M][S-16], [7M][W-16], [4M][S-19]
 Q.8.What is normalization. Explain various types of Normalization techniques with example. [7M]
[S-18]

December 22, 2022 Data Mining: Concepts and Techniques 41

Question Bank

 Q9. Explain the data discretization and concept hierarchy generation. [6M][S-17], [7M]
[S-19]
 Q10. What are the measures of data dispersion. [4M][S-19]
 Q11. What is the need for multidimensional analysis. [5M][S-16]
 Q12. Write short notes on:
 a. Binning b. Regressionc. Clustering d. Smoothing
 e. Generalization f. Aggregation
 Q13. Explain MIN-MAX normalization and Z-score normalization. [7M][W-17], [4M]
[S-16], [6M][S-19]
 Q14. Explain the various issues to be considered in data integration. Also give the
various forms of preprocessing? [6M][S-16]
 Q.15. What are the challenges in data preprocessing?

December 22, 2022 Data Mining: Concepts and Techniques 42

Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
Parenting Attitudes Study
No ratings yet
Parenting Attitudes Study
15 pages
8 D Report Format
No ratings yet
8 D Report Format
9 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
52 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Chap 3
No ratings yet
Chap 3
55 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
42 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
No ratings yet
Data Warehousing and Mining: Ii Unit: Data Preprocessing, Language Architecture Concept Description
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
02
No ratings yet
02
78 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
59 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Stata 14.1 Cheat Sheet
No ratings yet
Stata 14.1 Cheat Sheet
1 page
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Chapter 2 dataPreProcessing HAN
No ratings yet
Chapter 2 dataPreProcessing HAN
76 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Newbook
No ratings yet
Newbook
80 pages
3 Prep
No ratings yet
3 Prep
50 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
Unit II - RM Notes
No ratings yet
Unit II - RM Notes
12 pages
Unit 1 - DWM
No ratings yet
Unit 1 - DWM
112 pages
Autoencoders & Keras Overview
No ratings yet
Autoencoders & Keras Overview
42 pages
Bread Crumb Structure Analysis
No ratings yet
Bread Crumb Structure Analysis
11 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
17 pages
Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
Mortality Prediction Analysis
No ratings yet
Mortality Prediction Analysis
7 pages
Figure Classification & Mirror Image Test
No ratings yet
Figure Classification & Mirror Image Test
13 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Lect 4
No ratings yet
Lect 4
30 pages
History of Tablet Compression
No ratings yet
History of Tablet Compression
25 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
Iris Dataset EDA & ML Techniques
100% (2)
Iris Dataset EDA & ML Techniques
24 pages
Regional Monthly Runoff Forecast in Southern Canada Using ANN K Means and L Moments Techniques
No ratings yet
Regional Monthly Runoff Forecast in Southern Canada Using ANN K Means and L Moments Techniques
19 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
No ratings yet
Sensors: A Novel Secure Iot-Based Smart Home Automation System Using A Wireless Sensor Network
19 pages
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
No ratings yet
Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García
15 pages
Mean Variance Analysis and Tracking Error
No ratings yet
Mean Variance Analysis and Tracking Error
3 pages
Data Mining - Preprocessing
No ratings yet
Data Mining - Preprocessing
77 pages
An Effective Framework For Skyline Queries Using PCA
No ratings yet
An Effective Framework For Skyline Queries Using PCA
5 pages
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
No ratings yet
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
6 pages
Detection of Gold Bearing Rocks Using As
No ratings yet
Detection of Gold Bearing Rocks Using As
16 pages
Graph Neural Network-Based Anomaly Detection in Multivariate Time Series
No ratings yet
Graph Neural Network-Based Anomaly Detection in Multivariate Time Series
9 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
A Neural-Network-Based Nonlinear Metamodeling Approach To Financial Time Series Forecasting
No ratings yet
A Neural-Network-Based Nonlinear Metamodeling Approach To Financial Time Series Forecasting
12 pages
2020conf DeepFit
No ratings yet
2020conf DeepFit
16 pages
Cheatsheet Reflex Models
No ratings yet
Cheatsheet Reflex Models
4 pages
3 Prep
No ratings yet
3 Prep
53 pages
IoT Intrusion Detection with ML
100% (1)
IoT Intrusion Detection with ML
15 pages
Discovering Tut
No ratings yet
Discovering Tut
4 pages
Ho Multivariate
No ratings yet
Ho Multivariate
4 pages
Exam Malpractice Penalties
No ratings yet
Exam Malpractice Penalties
4 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Dymus Questionnaire
No ratings yet
Dymus Questionnaire
4 pages
Statistical and Machine-Learning Data Mining: Bruce Ratner
No ratings yet
Statistical and Machine-Learning Data Mining: Bruce Ratner
13 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Unit No. 02 - Feature Extraction & Selection
No ratings yet
Unit No. 02 - Feature Extraction & Selection
47 pages
Face Detection and Recognition Using Raspberry Pi Paper1
No ratings yet
Face Detection and Recognition Using Raspberry Pi Paper1
4 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Samvat 2081 Top Picks Report
No ratings yet
Samvat 2081 Top Picks Report
16 pages
Seed Priming With Poly-Gamma-Glutamic Acid g-PGA I
No ratings yet
Seed Priming With Poly-Gamma-Glutamic Acid g-PGA I
17 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Analisis Data 2
No ratings yet
Analisis Data 2
40 pages
UEU Sistem Pendukung Keputusan Pertemuan 11
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 11
48 pages
IOT Data Acquisition
No ratings yet
IOT Data Acquisition
13 pages
Guidelines grantInAid Scheme
No ratings yet
Guidelines grantInAid Scheme
35 pages
Aicte Books 2023
No ratings yet
Aicte Books 2023
4 pages
PPT-Hackathon Tiny Coders
No ratings yet
PPT-Hackathon Tiny Coders
21 pages
Rapid Analysis Technologies With Chemometrics Forf
No ratings yet
Rapid Analysis Technologies With Chemometrics Forf
28 pages
Data Mining Module 2 Important Topics PYQs
No ratings yet
Data Mining Module 2 Important Topics PYQs
35 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
No ratings yet
Sona College of Technology (Autonomous) : U23IT953-Data Warehousing and Data Mining
128 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Unit 1
No ratings yet
Unit 1
48 pages
2.3 Data Cleaning
No ratings yet
2.3 Data Cleaning
24 pages
Sushmeet Singh Bhurji
No ratings yet
Sushmeet Singh Bhurji
5 pages
28 April 2023 - BoS On New Branches - 0001
No ratings yet
28 April 2023 - BoS On New Branches - 0001
1 page
Capstone Project - Capstone Project
No ratings yet
Capstone Project - Capstone Project
1 page
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
Chapter 3 - Data Preparation - Data Mining Concepts and Techniques Han and Kamber
No ratings yet
Chapter 3 - Data Preparation - Data Mining Concepts and Techniques Han and Kamber
61 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages

Unit 2 - Data Preprocessing

Uploaded by

Unit 2 - Data Preprocessing

Uploaded by

St.

Vincent Pallotti College of Engineering

Data Warehousing and Mining

certain attributes of interest, or containing

 No quality data, no quality mining results!

December 22, 2022 Data Mining: Concepts and Techniques 5

December 22, 2022 Data Mining: Concepts and Techniques 6

December 22, 2022 Data Mining: Concepts and Techniques 8

December 22, 2022 Data Mining: Concepts and Techniques 9

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Combined computer and human inspection

deal with possible outliers)

December 22, 2022 Data Mining: Concepts and Techniques 14

December 22, 2022 Data Mining: Concepts and Techniques 16

December 22, 2022 Data Mining: Concepts and Techniques 17

 Entity identification problem:

e.g., Bill Clinton = William Clinton

different sources are different

scales, e.g., metric vs. British units

December 22, 2022 Data Mining: Concepts and Techniques 20

 Smoothing: remove noise from data

December 22, 2022 Data Mining: Concepts and Techniques 25

 Why data reduction?

 Complex data analysis/mining may take a very long time to run

on the complete data set

 Dimensionality reduction — e.g., remove unimportant attributes

 Numerosity reduction — e.g., fit data into models

 Discretization and concept hierarchy generation

December 22, 2022 Data Mining: Concepts and Techniques 33

 The lowest level of a data cube (base cuboid)

probability distribution of different classes given the

 Step-wise backward elimination

 Combining forward selection and backward elimination

December 22, 2022 Data Mining: Concepts and Techniques 35

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December 22, 2022 Data Mining: Concepts and Techniques 36

 Compute k orthonormal (unit) vectors, i.e., principal components

 Each input data (vector) is a linear combination of the k principal

reduced by eliminating the weak components, i.e., those with low

December 22, 2022 Data Mining: Concepts and Techniques 38

 Why preprocess the data?

December 22, 2022 Data Mining: Concepts and Techniques 41

December 22, 2022 Data Mining: Concepts and Techniques 42

You might also like