0% found this document useful (0 votes)

27 views68 pages

CH 3

The document discusses various aspects of data preprocessing, including data cleaning, integration and transformation, reduction, and discretization. It describes why preprocessing is important to ensure quality data and mining results. Key steps in preprocessing involve cleaning dirty or incomplete data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data can also be integrated and transformed through normalization, aggregation, and concept hierarchy generation. The goal of preprocessing is to prepare raw data into a format suitable for data mining and analytics tasks.

Uploaded by

gauravkhunt110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views68 pages

CH 3

Uploaded by

gauravkhunt110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Content

 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

Why preprocess the data?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or

names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was collected
and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
Multi-Dimensional Measure of Data
Quality
 A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
 Broad categories:
 Intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Forms of Data Preprocessing
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred.
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
 Noise: random error or variance in a measured variable

 Incorrect attribute values may due to

 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention

 Other data problems which requires data cleaning

 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal
with possible outliers)
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
Data can be smoothed by fitting the data to a
function, such as with regression.

Y1’ y=x+1

X1 x
Cluster Analysis
detect and remove outliers
Data Preprocessing
 Why preprocess the data?

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundancy in Data
Integration
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation
analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)

rA, B =
 ( A − A)( B − B)  ( AB) − n AB
=
(n − 1)AB (n − 1)AB
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values increase as

B’s). The higher, the stronger correlation.
 rA,B = 0: independent;
 rA,B < 0: negatively correlated
 The Higher value , the stronger the correlation. Hence a higher
value indicate that A or B may be removed as a redundancy.
Positively and Negatively Correlated Data
Not Correlated Data
Correlation Analysis (Categorical Data)
 Χ2 (chi-square) test

(Observed − Expected) 2
2 = 
Expected
Expected=count(A=ai)*count(B=bi)/n
Chi-Square Calculation: An Example
 E11=count(male)*count(fiction)/n=300*450/1500=90

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2

 =
2
+ + + = 507.93
90 210 360 840

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two categories)

 It shows that like gender and science fiction are correlated in the group
Covariance of Numeric Data
 correlation and covariance are two similar measures for assessing how much
two attributes change together. The mean values of A and B, respectively, are
also known as the expected values on A and B, that is,
Data Transformation
 Smoothing: remove noise from data. It includes binning , regression and
clustering
 Aggregation: summarization. E.g annual sale amount can be generated
from monthly sale and data cube construction for analysis at multiple
abstraction level
 Concept hierarchy generation for nominal data: where attributes such
as street can be generalized to higher-level concepts, like city or country
 Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,000 is mapped to 73,000-12,000/98,000-12,000(1.0-
0)+0=0.716

 Z-score normalization (μ: mean, σ: standard deviation):

v − A
v' =
 A

 Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

 o 73,600−54,000 /16,000 = 1.225.
Conti..
 Normalization by decimal scaling

v
v' = j
10
 Where j is the smallest integer such that Max(|ν’|) < 1

 Example recorded values - -722 to 821

 Divide each value by j=1000
 -28 normalizes to -.028
 444 normalizes to 0.444
 Practice example—200, 300, 400, 600, 1000 where min=0, max=1
Data Reduction Strategies
 Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis and mining on huge amounts of data can take a
long time, making such analysis impractical or infeasible.
 Complex data analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
 Data reduction strategies
 Dimensionality reduction — e.g., remove unimportant attributes
 Data Compression
 Numerosity reduction — e.g., fit data into models
 Discretization and concept hierarchy generation
 Data cube aggregation
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve the
task
 Queries regarding aggregated information should be answered
using data cube, when possible
Dimensionality reduction
 Dimensionality reduction is the process of reducing the number of
or attributes under consideration. Dimensionality reduction methods
include:

 Attribute subset selection is a method of dimensionality reduction in

which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed

 wavelet transforms and principal components analysis which

transform or project the original data onto a smaller space.
Attribute subset selection
 Reduces the data set size by removing irrelevant or redundant attributes (or
dimensions).

 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to
the original distribution given the values of all features

 reduce # of patterns in the patterns, easier to understand

 Heuristic methods (best & worst attributes determined using various

methods—Eg: Greedy):
 Step-wise forward selection
 Step-wise backward elimination
 Combining forward selection and backward elimination
 Decision-tree induction
Attribute subset selection
 These methods are typically greedy in that, while searching through attribute
space, they always make what looks to be the best choice at the time. Their
strategy is to make a locally optimal choice in the hope that this will lead to a
globally optimal solution
Attribute subset selection
 The stopping criteria for the methods may vary. The procedure
may employ a threshold on the measure used to determine when
to stop the attribute selection process.
Data Compression
 String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless
 But only limited manipulation is possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

 Time sequence is not audio

 Typically short and vary slowly with time
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated
Dimensionality Reduction:
Wavelet Transformation
 The discrete wavelet transform (DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically different
vector, X’, of wavelet coefficients

 we consider each tuple as an n-dimensional data vector, that is, X =(x1,x2,…,xn),

depicting n measurements made on the tuple from n database attributes.

 A compressed approximation of the data can be retained by storing only a

small fraction of the strongest of the wavelet coefficients. For example, all
wavelet coefficients larger than some user-specified threshold can be retained.

 Similar to discrete Fourier transform (DFT), but better lossy compression,

localized in space
Conti..

 Method:
 Length, L, of the input data vector must be an integer power of 2. (padding
with 0’s, when necessary)
 Each transform has 2 functions: smoothing, difference which acts to bring
out the detailed features of the data.
 Applies to pairs of data points in X, resulting in two set of data of length
L/2. In general, these represent a smoothed or low-frequency version of
the input data and the high frequency content of it, respectively.
 Applies two functions recursively, until reaches the desired length
 Selected values from the data sets obtained in the previous iterations are
designated the wavelet coefficients of the transformed data.
Dimensionality Reduction: Principal
Component Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Steps
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large
Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data
representation
 These techniques may be:
 Parametric --A model is used to estimate the data, so that typically
only the data parameters need to be stored, instead of the actual
data
 Regression and log-linear models

 Nonparametric
 Histograms

 Clustering
 Sampling and
 Data cube aggregation
Regression and log-linear models
 Regression and log-linear models can be used to approximate the given
data.
 In linear regression, the data are modeled to fit a straight line. For
example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
Y=w x + b
 In the context of data mining, x and y are numeric database attributes.
The coefficients, w and b called regression coefficients

 Multiple linear regression is an extension of (simple) linear

regression, which allows a response variable, y, to be modeled as a
linear function of two or more predictor variables.
Regression and log-linear models
 Log-linear models approximate discrete multidimensional probability
distributions

 Given a set of tuples in n dimensions (e.g., described by n attributes),

we can consider each tuple as a point in an n-dimensional space.

 Log-linear models can be used to estimate the probability of each

point in a multidimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations.

 Log-linear models are therefore also useful for dimensionality

reduction (since the lower-dimensional points together typically
occupy less space than the original data points)
Histograms
 Histograms use binning to approximate data distributions and are a
popular form of data reduction
 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
 Equal-width: width of each bucket range is uniform
 Equal-frequency (or equal-depth) :each bucket contains roughly the
same number of contiguous samples
 Histograms are highly effective at approximating both sparse and dense
data, as well as highly skewed and uniform data. The histograms for
single attributes can be extended for multiple attributes.
Multidimensional histograms can capture dependencies between
attributes.
The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
The numbers have been sorted:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Clustering
 Clustering techniques consider data tuples as objects.
 They partition the objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to objects
in other clusters.
 Similarity is commonly defined in terms of how “close” the objects are
in space, based on a distance function.
 The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
 Centroid distance is an alternative measure of cluster quality and is
defined as the average distance of each cluster object from the cluster
centroid
Sampling
 Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random data sample
 Suppose that a large data set, D, contains N tuples. Let’s look at the
most common ways that we could sample D for data reduction:

 Simple random sample without replacement

 Simple random sample with replacement
 Cluster sample
 Stratified sampling:
Approximate the percentage of each class (or subpopulation of
interest) in the overall database
 Used in conjunction with skewed data

 Sampling may not reduce database I/Os (page at a time).

Sampling (continued)

Once selected, can’t

be selected again

Raw Data Once selected, can be

selected again
Sampling (continued)
 If data is clustered or stratified, performed a simple random sample
(with or without replacement) in each cluster or Stratified
Raw Data Cluster/Stratified Sample
Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession

 Ordinal — values from an ordered set, e.g., military or academic

rank
 Continuous — real numbers, e.g., integer or real numbers

 Discretization:
 Divide the range of a continuous attribute into intervals

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Prepare for further analysis

Discretization and Concept Hierarchy
 Discretization
 Reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals
 Interval labels can then be used to replace actual data values

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Discretization can be performed recursively on an attribute

 Concept hierarchy formation

 Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level concepts
(such as young, middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
 Typical methods: All the methods can be applied recursively

 Binning (covered above)

 Top-down split, unsupervised,

 Histogram analysis (covered above)

 Top-down split, unsupervised

 Clustering analysis (covered above)

 Either top-down split or bottom-up merge, unsupervised

 Entropy-based discretization: supervised, top-down split

 Interval merging by 2 Analysis: unsupervised, bottom-up merge

 Segmentation by natural partitioning: top-down split, unsupervised

Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S1 and S2
using boundary T, the information gain after partitioning is
| S1 | |S |
I (S , T ) = Entropy( S1) + 2 Entropy( S 2)
|S| |S|
 Entropy is calculated based on class distribution of the samples in the set.
Given m classes, the entropy of S1 is
m
Entropy( S1 ) = − pi log 2 ( pi )
i =1
where pi is the probability of class i in S1
 The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until some
stopping criterion is met
 Such a boundary may reduce data size and improve classification accuracy
Interval Merge by 2 Analysis
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge them to form larger
intervals recursively
 ChiMerge
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least 2 values are merged together, since low
2 values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max
inconsistency, etc.)
Segmentation by Natural Partitioning

 A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals
 If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for Categorical Data

 Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based
on the analysis of the number of distinct values per
attribute in the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

Province or state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Why Data Mining Primitives and Languages?
 Finding all the patterns autonomously in a database? —
unrealistic because the patterns could be too many but
uninteresting
 Data mining should be an interactive
 process User directs what to be mined Users must be provided
with a set of primitives to be used to communicate with the data
mining system
 Incorporating these primitives in a data mining query language
 More flexible user interaction
 Foundation for design of graphical user interface
 Standardization of data mining industry and practice
What Defines a Data Mining Primitives
 Task-relevant data
 Type of knowledge to be mined
 Background knowledge
 Pattern interestingness measurements
 Visualization of discovered patterns
Data Mining Primitives
 Task-relevant data: This is the database portion to be
investigated. For example, suppose that you are a manager of All
Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of
customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes
 The kinds of knowledge to be mined: This specifies the data
mining functions to be performed, such as characterization,
discrimination, association, classification, clustering, or
evolution analysis. For instance, if studying the buying habits of
customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers
like to buy
Data Mining Primitives
 Background knowledge: Users can specify background
knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge
discovery process, and for evaluating the patterns found.

 Interestingness measures: These functions are used to

separate uninteresting patterns from knowledge.
 Simplicity: e.g., (association) rule length
 Certainty: e.g., confidence, P(A|B)
 Utility: potential usefulness, e.g., support (association)
 Novelty: not previously known, surprising
Data Mining Primitives
 Presentation and visualization of discovered
patterns:
 This refers to the form in which discovered patterns are
to be displayed. Users can choose from different forms
for knowledge presentation, such as rules, tables, charts,
graphs, decision trees, and cubes.
A Data Mining Query Language (DMQL)
 Motivation
 A DMQL can provide the ability to support ad-hoc and
interactive data mining
 By providing a standardized language like SQL
 Design
 DMQL is designed with the primitives described earlier
 The DMQL can work with databases and data
warehouses as well. DMQL can be used to define data
mining tasks.
A Data Mining Query Language (DMQL)
 Syntax for Task-Relevant Data Specification
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
 Syntax for Specifying the Kind of Knowledge
 Characterization
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum,
or count%.
A Data Mining Query Language (DMQL)
 Association
mine associations [ as {pattern_name} ]
{matching {metapattern} }
Eg. mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables;
and W, Y, and Z are object variables.
 Classification
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
Eg. analyze credit_rating
 Syntax for Concept Hierarchy Specification
use hierarchy <hierarchy> for <attribute_or_dimension>
Eg. define hierarchy time_hierarchy on date as [date,month
quarter,year]
A Data Mining Query Language (DMQL)
 Syntax for Interestingness Measures Specification
 with support threshold = 0.05
 Syntax for Pattern Presentation and Visualization Specification
 display as table
 Full Specification of DMQL
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made from customer C, item I, purchase P,
items_sold S, branch B where I.item_ID = S.item_ID and P.cust_ID =
C.cust_ID and P.method_paid = "AmEx" and B.address = "Canada" and I.price
≥ 100 with
noise threshold = 5%
display as table

In The Mountains (Form 4)
50% (2)
In The Mountains (Form 4)
6 pages
Importance of Water Cycle
100% (1)
Importance of Water Cycle
8 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Week2 2
No ratings yet
Week2 2
25 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
2-Data Fundamentals For BI - Part1
No ratings yet
2-Data Fundamentals For BI - Part1
39 pages
3 Processing
No ratings yet
3 Processing
79 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Mining
No ratings yet
Mining
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Lec 7
No ratings yet
Lec 7
45 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
DM Merged
No ratings yet
DM Merged
169 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Module 2
No ratings yet
Module 2
62 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Tut 8
No ratings yet
Tut 8
1 page
Chapter One-CSR
No ratings yet
Chapter One-CSR
9 pages
Lionel Bekier, in The Matter of Jonathan Bekier, Infant v. Bettina Srour Bekier, in The Matter of Jonathan Bekier, Infant, Defendant, 248 F.3d 1051, 11th Cir. (2001)
No ratings yet
Lionel Bekier, in The Matter of Jonathan Bekier, Infant v. Bettina Srour Bekier, in The Matter of Jonathan Bekier, Infant, Defendant, 248 F.3d 1051, 11th Cir. (2001)
7 pages
COVID-19 Ventilation: Helmet vs Nasal Oxygen
No ratings yet
COVID-19 Ventilation: Helmet vs Nasal Oxygen
14 pages
Machine Learinig Ja Bca 2nd Year Part 1
No ratings yet
Machine Learinig Ja Bca 2nd Year Part 1
10 pages
21st ABM A Script
No ratings yet
21st ABM A Script
9 pages
4 Days Bali (Gtmc-Git)
No ratings yet
4 Days Bali (Gtmc-Git)
1 page
A Nomadic and Later Buddhist Art of Mongolia
100% (1)
A Nomadic and Later Buddhist Art of Mongolia
331 pages
Western Colleges, Inc.: High School Department
No ratings yet
Western Colleges, Inc.: High School Department
11 pages
Lesson Plan Letter H
No ratings yet
Lesson Plan Letter H
5 pages
Company Accounts BBA 6TH Sem CCS UNIT-1
No ratings yet
Company Accounts BBA 6TH Sem CCS UNIT-1
2 pages
Architecture Books Overview 2016
No ratings yet
Architecture Books Overview 2016
8 pages
Cambridge International AS & A Level: Global Perspectives and Research 9239/12
No ratings yet
Cambridge International AS & A Level: Global Perspectives and Research 9239/12
25 pages
English 7 Learning Plan - 1ST QUARTER
0% (1)
English 7 Learning Plan - 1ST QUARTER
13 pages
Smallfoot English
No ratings yet
Smallfoot English
135 pages
Act 1 Quiz (R&J) ENL1W.2 2025 - Ms. Barrile - Multiple Choice & Long Answer
No ratings yet
Act 1 Quiz (R&J) ENL1W.2 2025 - Ms. Barrile - Multiple Choice & Long Answer
2 pages
Mathematical Logic Essentials
No ratings yet
Mathematical Logic Essentials
59 pages
Legal Dispute on DOJ Order in Murder Case
No ratings yet
Legal Dispute on DOJ Order in Murder Case
14 pages
Business Ethics PPT Final
100% (1)
Business Ethics PPT Final
35 pages
Property Law: Real vs. Personal Rights
No ratings yet
Property Law: Real vs. Personal Rights
8 pages
The Night Before Christmas: Inside
No ratings yet
The Night Before Christmas: Inside
2 pages
Practice Test On Advanced Vocabular1
No ratings yet
Practice Test On Advanced Vocabular1
4 pages
Mapeh 4th Quarter Las
No ratings yet
Mapeh 4th Quarter Las
8 pages
Full Download (Ebook) Cinema by Other Means by Pavle Levi ISBN 9780199841400, 9780199841424, 0199841403, 019984142X PDF
No ratings yet
Full Download (Ebook) Cinema by Other Means by Pavle Levi ISBN 9780199841400, 9780199841424, 0199841403, 019984142X PDF
82 pages
Documentary Stamp Tax
No ratings yet
Documentary Stamp Tax
14 pages
Exploring The Impact of Technology Implementation at The Elementa
No ratings yet
Exploring The Impact of Technology Implementation at The Elementa
44 pages
E. Nursing Diagnosis
No ratings yet
E. Nursing Diagnosis
2 pages
Viet Lai Cau
No ratings yet
Viet Lai Cau
10 pages

CH 3

Uploaded by

CH 3

Uploaded by

Content

 Why preprocess the data?

 Data integration and transformation

 Discretization and concept hierarchy generation

 inconsistent: containing discrepancies in codes or

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

 Incorrect attribute values may due to

 Other data problems which requires data cleaning

 Equal-depth (frequency) partitioning

 Data integration and transformation

 Discretization and concept hierarchy generation

 If rA,B > 0, A and B are positively correlated (A’s values increase as

(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2

 Χ2 (chi-square) calculation (numbers in parenthesis are expected

 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

 Z-score normalization (μ: mean, σ: standard deviation):

 Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

 Example recorded values - -722 to 821

 Attribute subset selection is a method of dimensionality reduction in

 wavelet transforms and principal components analysis which

 Feature selection (i.e., attribute subset selection):

 reduce # of patterns in the patterns, easier to understand

 Heuristic methods (best & worst attributes determined using various

 Time sequence is not audio

Original Data Compressed

 we consider each tuple as an n-dimensional data vector, that is, X =(x1,x2,…,xn),

 A compressed approximation of the data can be retained by storing only a

 Similar to discrete Fourier transform (DFT), but better lossy compression,

 Multiple linear regression is an extension of (simple) linear

 Given a set of tuples in n dimensions (e.g., described by n attributes),

 Log-linear models can be used to estimate the probability of each

 Log-linear models are therefore also useful for dimensionality

 Simple random sample without replacement

 Sampling may not reduce database I/Os (page at a time).

Once selected, can’t

Raw Data Once selected, can be

 Ordinal — values from an ordered set, e.g., military or academic

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Prepare for further analysis

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Discretization can be performed recursively on an attribute

 Concept hierarchy formation

 Binning (covered above)

 Top-down split, unsupervised,

 Top-down split, unsupervised

 Either top-down split or bottom-up merge, unsupervised

 Interval merging by 2 Analysis: unsupervised, bottom-up merge

 Segmentation by natural partitioning: top-down split, unsupervised

 A simply 3-4-5 rule can be used to segment numeric data

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

 Specification of a partial/total ordering of attributes

country 15 distinct values

Province or state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

 Interestingness measures: These functions are used to

You might also like