Session 2-Data Preprocessing

The document outlines the importance of data preprocessing in the data mining process, emphasizing the need for data cleaning, integration, reduction, and transformation to improve data quality and mining results. It details various techniques for handling issues such as missing values, noisy data, and redundancy, as well as methods for transforming data into suitable forms for analysis. Additionally, it references key texts in the field of data mining to support the concepts discussed.

Uploaded by

charitale03102004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views29 pages

Session 2-Data Preprocessing

Uploaded by

charitale03102004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Session 2: Data Preprocessing

ITEC5310- DATA MINING

Some pictures are copied from 02 text books:

Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, Elsevier, 2012.
Max Bramer, Principles of data mining, Springer, 2007.
Contents:

 Data Mining and Knowledge Discovery Process.

 Techiques/Problems in Data Mining
 Why data preprocessing?.
 What data preprocessing does?.
Knowledge Discovery Process
• Knowledge Discovery Process.
• From data to knowledge
• A lot of models for KDP.

• Data mining: a step/component in KDP.

• DM course: learn main problem in DM but have to concern whole KDP.
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
Estimate the cost of time (each step in KDP).
Estimate the cost of time (each step in KDP).
Why Preprocess the Data
• Objective: Improve quality of data  get better results in DM
• Raw data:
• Structured data, semi-structured, non-structured.
• From other different sources.
• Requirements (data for DM):
• Accuracy: get real and exact values.
• Currency/Timeliness: not legacy, useful in present time.
• Completeness: collect all values of feature.
• Consistency: all same values has a unique value in all cases
(Male/Yes/1/Nam  has only unique vaue).
Major Tasks in Data Preprocessing
• Data Cleaning:
• Missing Values
• Noisy Data…
• Data Integretion
• Entity Identification Problem
• Redundancy and Correlation Analysis
• Tuple Duplication
• ..
• Data Reduction
• Dimensionality reduction
• Numerosity reduction
• Data Transformation
• Smoothing
• Attribute construction
• Aggregation, Normalization, Discretization
Data Cleaning
• Data always has errors: noise, invalid, inconsistency  bad
results when Data Mining.
• Noise: exist in raw data! wrong but valid!
• 9.678 but 19.678
• BBT but ABB
• Invalid values: easy detect!!
• 26.7a8
• TpHHCM
• Inconsistency: one meaning but many values!.
• TpHCM, HCMC, SaiGon
• 25/7/2010, 07/25/2010
• Problem: detect noise, invalid data, onconsistency data,..in data
set has a lot of rows, a lot of features.
Data Cleaning (cont.)
• How to dectect and correct?
• Using tool, application.
• Write program.
• Predicting and heuristic
• Example
• An attribute only get integer values in range 0..5.
• An attribute only has nominal values in set : brown/blue/black/white.
• When detect errors in raw data  Do not erase wrong data
immediately and try find the cause of mistake!!!
• Example
• An attribute is allowed 1 real value >= 200.
• Checking and detect error: 22654.8, 38597 and 44625.7
• Some values has appears with abnormal frequencies.
• Checking and detect: 25% user select Country= Albany!!!!
• Abnormal with objects in data set
• Checking: detect a lot of students at age of 127!!
Data Cleaning (cont.)
• Data Missing: how to process.
• Has no data in 1/some features in raw data
• Causes:
• Objective: does not exist at the time of data entry, incidents, etc.
• Subjective: equipment, human, etc.
• Processing :
• Remove tuples with missing data.
• Manually input the missing data.
• Use common symbols to describe (unknown, infinitive, etc.).
• Use attribute measurements: mean, median.
• Use mean/median values from attributes within the same class.
• Use some models to predict: regression, Bayes, etc.
• Preventing missing data:
• Need well database design.
• Setting data entry procedures (data constraints).
Data Cleaning (cont.)
• Noisy Data:.
• Values in range which are suspected errors .
• Cause: Similar with data missing’s causes.
• Processing: Smoothing data
• Binning: also used in “discretize”
• Use with numeric data.
• Sort and “smooth” with available values for each “block” of data.
• For example:
• 9 values, sort and divide into 3 buckets.
• Solution 1: smoothing by bin means
• Solution 2: smoothing by bin medians
• Solution 3: smoothing by bin boundaries
Data Cleaning (cont.)
• Noisy Data (cont.):
• Regression:
• Linear Regression.
• MultiVariate Regression.
• Outlier Analysis
Data Integration
• Data Integration: Data integration process from different sources
• Problems:
Entity identification problem
 Schema integration
 Object matching
Data redundancy
Tuple Duplication
Data value conflicts
• Require: Understand data structure, understand heterogeneity and
semantics of data
• Results:
 Reduce, avoid redundancy, inconsistency in data.
 Improve speed and accuracy when DM
Data Integration (cont.)
Entity identification problem
• Cause: Data integration from multiple sources, diverse data types:
databases, data cubes, "flat files.".
• Schema integration: Detect and unify data schemas.
• Object matching: Identify duplicate attributes and information.
• Example
• MaNhanVien vs. MANV.
• Ethnic: Kinh/Hoa/… vs 0/1/…
• Regions, special values (null, 0,…)
• …
• Need and using “meta data”: understand and correct,..
Data Integration (cont.)
 Redundancy: Values in one feature depends on 1/many other features.
 Using Correlation Analysis.
 Nominal Attribute: using Chi-Square.
 Numeric Attribute: using correlation coefficient and covariance

Degrees of freedom: (r-1)x(c-1)=(2-1)x(2-1) = 1

Significance level 0.001 (99%)  10.828 <507.93
Reject the independence hypothesis!!
Data Integration (cont.)
Redundancy:

correlation coefficient

Covariance
Data Integration (cont.)
Tuple Duplicate
• Redundancy: feature has reation with other features.
• Data repeat or redundancy on tuples.
• Examples:
Uisng fullname, address instead of Worker ID.
Same customer but has different addresses.
 Data conflict
• Cause: Encoding, scaling, representation
• Example:
Fareinheit and Celcius.
Mark range in GPA (A..D and 1..4)
• Cause from meaning:
• Example:
Sales: Sale of a store (in table A) but of an area (in table B)
Data Reduction
• Data Reduction: data has been transformed, ensuring integrity,
but is less in quantity compared to the original.
• Techniques:
 Select an attribute subset selection: Construct a subset of attributes (n'
< n) while retaining the maximum information in the dataset.

 Dimensionality reduction): Principal Component Analysis (PCA).

 Numerosity reduction

Data compression: lossless data and lossy data

Data Reduction (cont)

Feature Selection: Stepwise forward, Stepwise backward,..

Principal Component Analysis

Data Transfomation
• Data Transfomation: Transform/Combine data into suitable
forms for DM process.
• Data transformation techniques:
Data smoothing: using binning, regression, clustering.
Data aggregation: commonly used to create data cubes and analyze
multi-level summary data (e.g., revenue by year-branch-product group-
item).
Attribute/feature construction: to serve effective data mining (e.g., date
of birth → year of birth).
Hierarchical structuring for categorical attributes: Address → hierarchy
by street/district/city/..
Data Transfomation (cont.)
• Data transformation techniques (cont.)
• Normalization: Transform the values of attributes into a value range..
• Min-Max Normalization: [MinA, MaxA]  [newMinA, newMaxA].

• Z-score Normalization : using mean and variance.

• Decimal scaling:
j: minum integer number st max(|v’i|)<1

Example [-986, 917], max = 986, select j=3 (because 103=1000>986)  [-0.986, 0.917]
Data Transfomation (cont.)
• Data transformation techniques (cont.)
• Discretize (features)
A lot of methods/algorithms:

Discretize by Binning.
Uisng Histogram: equal-width, equal-frequency.
Using clustering, decision tree, correlation analysis.

Some algorithm using class information  better but only use with label
data (classify problems)
Pre processing data with WEKA
Reading Chapter
1. Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine
Learning Tools and Techniques, Third Edition, Morgan Kaufmann, 2011.
2. David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, MIT
Press, 2001
3. Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and
Techniques, 3rd Edition, Elsevier, 2012.
4. Max Bramer, Principles of data mining, Springer, 2007.

[1]: Chapter 2.
[3]: Chapter 3.
[4]: Chapter 2.

Learning and Behavior 9th Edition Full Version Download
82% (11)
Learning and Behavior 9th Edition Full Version Download
17 pages
Review Answers: Your Answer
50% (2)
Review Answers: Your Answer
3 pages
Simple Sabotage Field Manual
50% (2)
Simple Sabotage Field Manual
16 pages
Beginner's Guide To Accounting
100% (3)
Beginner's Guide To Accounting
70 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Correlation
No ratings yet
Correlation
14 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Week2 2
No ratings yet
Week2 2
25 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
CH 3
No ratings yet
CH 3
68 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Session 3-Clustering
No ratings yet
Session 3-Clustering
41 pages
Session 5-ClassifyWithKnn-NaiveBayes
No ratings yet
Session 5-ClassifyWithKnn-NaiveBayes
39 pages
Session 4-Associate Rules
No ratings yet
Session 4-Associate Rules
49 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
Chapter 1+2+GSCM
No ratings yet
Chapter 1+2+GSCM
45 pages
Parenteral Feeding
No ratings yet
Parenteral Feeding
3 pages
Learning Objectives: Introduction W
No ratings yet
Learning Objectives: Introduction W
238 pages
STANDARD OPERATING PROCEDURES Masjid CFS
50% (2)
STANDARD OPERATING PROCEDURES Masjid CFS
2 pages
INFINITIVO Inglés
No ratings yet
INFINITIVO Inglés
20 pages
STEP 7 V56 - Compatibility List
No ratings yet
STEP 7 V56 - Compatibility List
31 pages
R28922 Payslip Jun2023
No ratings yet
R28922 Payslip Jun2023
1 page
FLC Provider Database
0% (1)
FLC Provider Database
15 pages
FBRE 07 Fletcher Reo - Mesh Guide North Island V10.00.0322 MR
No ratings yet
FBRE 07 Fletcher Reo - Mesh Guide North Island V10.00.0322 MR
20 pages
BIOLOGY PLUS TWO Short Notes - Line Foundation
No ratings yet
BIOLOGY PLUS TWO Short Notes - Line Foundation
9 pages
Virtual Palletization Plan FNDE
No ratings yet
Virtual Palletization Plan FNDE
299 pages
Topic-Economic Role For Advertisement Development
No ratings yet
Topic-Economic Role For Advertisement Development
11 pages
Radiology MD Training Guide
No ratings yet
Radiology MD Training Guide
12 pages
Unit 3 Theories and Principles in The Use and Design of Technology Driven Learning Lessons
100% (1)
Unit 3 Theories and Principles in The Use and Design of Technology Driven Learning Lessons
49 pages
2014 E400 W212 Relay & Fuse Guide
No ratings yet
2014 E400 W212 Relay & Fuse Guide
15 pages
Unidad 4
No ratings yet
Unidad 4
12 pages
Alundra2-Exact Location of Puzzle Pieces
No ratings yet
Alundra2-Exact Location of Puzzle Pieces
3 pages
Richland Technologies 5th Anniversary Press Release
No ratings yet
Richland Technologies 5th Anniversary Press Release
2 pages
Arrays: Shristi Technology Labs
No ratings yet
Arrays: Shristi Technology Labs
9 pages
Clinical Microbiology MCQ Practice Test
100% (4)
Clinical Microbiology MCQ Practice Test
13 pages
Radiant July 2018
No ratings yet
Radiant July 2018
18 pages
IJRPR15453
No ratings yet
IJRPR15453
7 pages
AP0070462152019
No ratings yet
AP0070462152019
1 page
MGMT5410 STRATEGIC MANAGEMENT - Outline 25-1-24
No ratings yet
MGMT5410 STRATEGIC MANAGEMENT - Outline 25-1-24
16 pages
Iseg Datasheet BPS en
No ratings yet
Iseg Datasheet BPS en
12 pages