Data Preparation
1
Understanding Data
•
Types of attributes or fields
•
Kind of values each attribute have
•
Which attributes are discrete, and which are continuous-valued?
•
What do the data look like?
•
How are the values distributed?
•
Are there ways we can visualize the data
•
Can we spot any outliers?
•
Can we measure the similarity of some data objects with respect to
others?
•
Knowing basic statistics about the data
2
Understanding Data
3
Understand the Current Schema/Data
To understand one attribute:
min, max, avg, histogram, amount of missing values,
value range
data type, length of values, etc.
synonyms, formats
To understand the relationship between two attributes
various plots
4
Why
•
Knowing such basic statistics regarding each attribute makes it easier to fill
missing values, smooth noisy values, and spot outliers during data
preprocessing.
•
Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.
•
Plotting the measures of central tendency shows us if the data are
symmetric or skewed.
•
Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.
•
These can all be useful during data preprocessing and can provide insight
into areas for mining.
•
The field of data visualization provides many additional techniques for
viewing data through graphical means.
5
Types of Data
•
Datasets are made up of data objects and attributes .
•
A data object represents an entity—
•
In a sales database, the objects may be customers, store
items, and sales;
•
In a medical database, the objects may be patients;
•
In a university database, the objects may be students,
professors,and courses.
6
Attribute
•
An attribute is a data field, representing a characteristic or feature of a data
object.
•
Attributes describing a customer object for example, customer ID, name,
and address.
•
Observed values for a given attribute are known as observations
•
Attribute vector (feature vector) set of attributes for an object
•
Type of attributes
•
nominal, binary(symmetric and asymmetric), ordinal, numeric
7
Identify Data Problems
Data Quality Problems
missing values
incorrect values, illegal values, outliers
synonyms
mispellings
conflicting data (eg, age and birth year)
wrong value formats
variations of values
duplicate tuples
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
8
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
9
1. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “ −10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age = “ 42” , Birthday = “ 03/07/2010”
Was rating “ 1, 2, 3” , now rating “ A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
10
A. Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred
11
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “ unknown” , a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
12
B. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
14
Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
15
2. Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Tuple duplication
• There are two or more identical tuples for a given unique data entry case
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
16
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “ derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
3. Data Reduction
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data
set.
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Reduce time and space required in data mining
Allow easier visualization
18
Data reduction strategies
• Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
• Compression - Dimensionality and numerosity reduction are considered
as forms of data compression
• Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.
19
Attribute Subset Selection
Attribute subset selection reduces the dataset size by removing
irrelevant or redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using
all attributes.
Mining on a reduced set of attributes has an additional benefit of
reducing the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.
20
Lab Exercise
• Practice the different data preprocessing tasks in Python
21