Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
94 views21 pages

Data Preparation

This document discusses the importance of data preparation before data analysis and mining. It covers understanding the data types, attributes, values and distributions. It also discusses understanding the current data schema, identifying data quality problems like missing values, outliers and inconsistencies. The major tasks of data preprocessing like data cleaning, integration and reduction are introduced. Specific techniques for handling missing data, noisy data, data integration and reduction are also provided.

Uploaded by

Abebe Chekol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views21 pages

Data Preparation

This document discusses the importance of data preparation before data analysis and mining. It covers understanding the data types, attributes, values and distributions. It also discusses understanding the current data schema, identifying data quality problems like missing values, outliers and inconsistencies. The major tasks of data preprocessing like data cleaning, integration and reduction are introduced. Specific techniques for handling missing data, noisy data, data integration and reduction are also provided.

Uploaded by

Abebe Chekol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Preparation

1
Understanding Data

Types of attributes or fields

Kind of values each attribute have

Which attributes are discrete, and which are continuous-valued?

What do the data look like?

How are the values distributed?

Are there ways we can visualize the data

Can we spot any outliers?

Can we measure the similarity of some data objects with respect to
others?

Knowing basic statistics about the data

2
Understanding Data

3
Understand the Current Schema/Data
 To understand one attribute:
 min, max, avg, histogram, amount of missing values,

value range
 data type, length of values, etc.

 synonyms, formats

 To understand the relationship between two attributes


 various plots

4
Why

Knowing such basic statistics regarding each attribute makes it easier to fill
missing values, smooth noisy values, and spot outliers during data
preprocessing.

Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.

Plotting the measures of central tendency shows us if the data are
symmetric or skewed.

Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.

These can all be useful during data preprocessing and can provide insight
into areas for mining.

The field of data visualization provides many additional techniques for
viewing data through graphical means.
5
Types of Data


Datasets are made up of data objects and attributes .

A data object represents an entity—

In a sales database, the objects may be customers, store
items, and sales;

In a medical database, the objects may be patients;

In a university database, the objects may be students,
professors,and courses.

6
Attribute

An attribute is a data field, representing a characteristic or feature of a data
object.

Attributes describing a customer object for example, customer ID, name,
and address.

Observed values for a given attribute are known as observations

Attribute vector (feature vector) set of attributes for an object

Type of attributes

nominal, binary(symmetric and asymmetric), ordinal, numeric

7
Identify Data Problems
 Data Quality Problems
 missing values
 incorrect values, illegal values, outliers
 synonyms
 mispellings
 conflicting data (eg, age and birth year)
 wrong value formats
 variations of values
 duplicate tuples
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

8
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

9
1. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “ −10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age = “ 42” , Birthday = “ 03/07/2010”
Was rating “ 1, 2, 3” , now rating “ A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?

10
A. Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred

11
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “ unknown” , a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree

12
B. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers

14
Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

15
2. Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Tuple duplication
• There are two or more identical tuples for a given unique data entry case
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
16
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “ derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
3. Data Reduction
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data
set.
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Reduce time and space required in data mining
Allow easier visualization

18
Data reduction strategies

• Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
• Compression - Dimensionality and numerosity reduction are considered
as forms of data compression
• Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.

19
Attribute Subset Selection
Attribute subset selection reduces the dataset size by removing
irrelevant or redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using
all attributes.
Mining on a reduced set of attributes has an additional benefit of
reducing the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.

20
Lab Exercise

• Practice the different data preprocessing tasks in Python

21

You might also like