Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views29 pages

Session 2-Data Preprocessing

The document outlines the importance of data preprocessing in the data mining process, emphasizing the need for data cleaning, integration, reduction, and transformation to improve data quality and mining results. It details various techniques for handling issues such as missing values, noisy data, and redundancy, as well as methods for transforming data into suitable forms for analysis. Additionally, it references key texts in the field of data mining to support the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views29 pages

Session 2-Data Preprocessing

The document outlines the importance of data preprocessing in the data mining process, emphasizing the need for data cleaning, integration, reduction, and transformation to improve data quality and mining results. It details various techniques for handling issues such as missing values, noisy data, and redundancy, as well as methods for transforming data into suitable forms for analysis. Additionally, it references key texts in the field of data mining to support the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Session 2: Data Preprocessing

ITEC5310- DATA MINING

Some pictures are copied from 02 text books:


Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, Elsevier, 2012.
Max Bramer, Principles of data mining, Springer, 2007.
Contents:

 Data Mining and Knowledge Discovery Process.


 Techiques/Problems in Data Mining
 Why data preprocessing?.
 What data preprocessing does?.
Knowledge Discovery Process
• Knowledge Discovery Process.
• From data to knowledge
• A lot of models for KDP.

• Data mining: a step/component in KDP.


• DM course: learn main problem in DM but have to concern whole KDP.
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
KDP Models
• Fews of KDP (Knowledge Discovery Process) Models
Estimate the cost of time (each step in KDP).
Estimate the cost of time (each step in KDP).
Why Preprocess the Data
• Objective: Improve quality of data  get better results in DM
• Raw data:
• Structured data, semi-structured, non-structured.
• From other different sources.
• Requirements (data for DM):
• Accuracy: get real and exact values.
• Currency/Timeliness: not legacy, useful in present time.
• Completeness: collect all values of feature.
• Consistency: all same values has a unique value in all cases
(Male/Yes/1/Nam  has only unique vaue).
Major Tasks in Data Preprocessing
• Data Cleaning:
• Missing Values
• Noisy Data…
• Data Integretion
• Entity Identification Problem
• Redundancy and Correlation Analysis
• Tuple Duplication
• ..
• Data Reduction
• Dimensionality reduction
• Numerosity reduction
• Data Transformation
• Smoothing
• Attribute construction
• Aggregation, Normalization, Discretization
Data Cleaning
• Data always has errors: noise, invalid, inconsistency  bad
results when Data Mining.
• Noise: exist in raw data! wrong but valid!
• 9.678 but 19.678
• BBT but ABB
• Invalid values: easy detect!!
• 26.7a8
• TpHHCM
• Inconsistency: one meaning but many values!.
• TpHCM, HCMC, SaiGon
• 25/7/2010, 07/25/2010
• Problem: detect noise, invalid data, onconsistency data,..in data
set has a lot of rows, a lot of features.
Data Cleaning (cont.)
• How to dectect and correct?
• Using tool, application.
• Write program.
• Predicting and heuristic
• Example
• An attribute only get integer values in range 0..5.
• An attribute only has nominal values in set : brown/blue/black/white.
• When detect errors in raw data  Do not erase wrong data
immediately and try find the cause of mistake!!!
• Example
• An attribute is allowed 1 real value >= 200.
• Checking and detect error: 22654.8, 38597 and 44625.7
• Some values has appears with abnormal frequencies.
• Checking and detect: 25% user select Country= Albany!!!!
• Abnormal with objects in data set
• Checking: detect a lot of students at age of 127!!
Data Cleaning (cont.)
• Data Missing: how to process.
• Has no data in 1/some features in raw data
• Causes:
• Objective: does not exist at the time of data entry, incidents, etc.
• Subjective: equipment, human, etc.
• Processing :
• Remove tuples with missing data.
• Manually input the missing data.
• Use common symbols to describe (unknown, infinitive, etc.).
• Use attribute measurements: mean, median.
• Use mean/median values from attributes within the same class.
• Use some models to predict: regression, Bayes, etc.
• Preventing missing data:
• Need well database design.
• Setting data entry procedures (data constraints).
Data Cleaning (cont.)
• Noisy Data:.
• Values in range which are suspected errors .
• Cause: Similar with data missing’s causes.
• Processing: Smoothing data
• Binning: also used in “discretize”
• Use with numeric data.
• Sort and “smooth” with available values for each “block” of data.
• For example:
• 9 values, sort and divide into 3 buckets.
• Solution 1: smoothing by bin means
• Solution 2: smoothing by bin medians
• Solution 3: smoothing by bin boundaries
Data Cleaning (cont.)
• Noisy Data (cont.):
• Regression:
• Linear Regression.
• MultiVariate Regression.
• Outlier Analysis
Data Integration
• Data Integration: Data integration process from different sources
• Problems:
Entity identification problem
 Schema integration
 Object matching
Data redundancy
Tuple Duplication
Data value conflicts
• Require: Understand data structure, understand heterogeneity and
semantics of data
• Results:
 Reduce, avoid redundancy, inconsistency in data.
 Improve speed and accuracy when DM
Data Integration (cont.)
Entity identification problem
• Cause: Data integration from multiple sources, diverse data types:
databases, data cubes, "flat files.".
• Schema integration: Detect and unify data schemas.
• Object matching: Identify duplicate attributes and information.
• Example
• MaNhanVien vs. MANV.
• Ethnic: Kinh/Hoa/… vs 0/1/…
• Regions, special values (null, 0,…)
• …
• Need and using “meta data”: understand and correct,..
Data Integration (cont.)
 Redundancy: Values in one feature depends on 1/many other features.
 Using Correlation Analysis.
 Nominal Attribute: using Chi-Square.
 Numeric Attribute: using correlation coefficient and covariance

Degrees of freedom: (r-1)x(c-1)=(2-1)x(2-1) = 1


Significance level 0.001 (99%)  10.828 <507.93
Reject the independence hypothesis!!
Data Integration (cont.)
Redundancy:

correlation coefficient

Covariance
Data Integration (cont.)
Tuple Duplicate
• Redundancy: feature has reation with other features.
• Data repeat or redundancy on tuples.
• Examples:
Uisng fullname, address instead of Worker ID.
Same customer but has different addresses.
 Data conflict
• Cause: Encoding, scaling, representation
• Example:
Fareinheit and Celcius.
Mark range in GPA (A..D and 1..4)
• Cause from meaning:
• Example:
Sales: Sale of a store (in table A) but of an area (in table B)
Data Reduction
• Data Reduction: data has been transformed, ensuring integrity,
but is less in quantity compared to the original.
• Techniques:
 Select an attribute subset selection: Construct a subset of attributes (n'
< n) while retaining the maximum information in the dataset.

 Dimensionality reduction): Principal Component Analysis (PCA).

 Numerosity reduction

Data compression: lossless data and lossy data


Data Reduction (cont)

Feature Selection: Stepwise forward, Stepwise backward,..

Principal Component Analysis


Data Transfomation
• Data Transfomation: Transform/Combine data into suitable
forms for DM process.
• Data transformation techniques:
Data smoothing: using binning, regression, clustering.
Data aggregation: commonly used to create data cubes and analyze
multi-level summary data (e.g., revenue by year-branch-product group-
item).
Attribute/feature construction: to serve effective data mining (e.g., date
of birth → year of birth).
Hierarchical structuring for categorical attributes: Address → hierarchy
by street/district/city/..
Data Transfomation (cont.)
• Data transformation techniques (cont.)
• Normalization: Transform the values of attributes into a value range..
• Min-Max Normalization: [MinA, MaxA]  [newMinA, newMaxA].

• Z-score Normalization : using mean and variance.

• Decimal scaling:
j: minum integer number st max(|v’i|)<1

Example [-986, 917], max = 986, select j=3 (because 103=1000>986)  [-0.986, 0.917]
Data Transfomation (cont.)
• Data transformation techniques (cont.)
• Discretize (features)
A lot of methods/algorithms:

Discretize by Binning.
Uisng Histogram: equal-width, equal-frequency.
Using clustering, decision tree, correlation analysis.

Some algorithm using class information  better but only use with label
data (classify problems)
Pre processing data with WEKA
Reading Chapter
1. Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine
Learning Tools and Techniques, Third Edition, Morgan Kaufmann, 2011.
2. David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, MIT
Press, 2001
3. Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and
Techniques, 3rd Edition, Elsevier, 2012.
4. Max Bramer, Principles of data mining, Springer, 2007.

[1]: Chapter 2.
[3]: Chapter 3.
[4]: Chapter 2.

You might also like