🔹 Data Preprocessing: An Overview
Real-world data are often dirty, incomplete, noisy, and inconsistent.
If we directly apply mining algorithms, we get poor or misleading results.
👉 Hence, we must preprocess the data.
1. What Defines Data Quality?
Data have quality if they satisfy the requirements of the intended use.
Key Factors of Data Quality:
1. Accuracy
o Correctness of values.
o Example: Age recorded as 250 → inaccurate.
o Cause: faulty sensors, entry errors, wrong units.
2. Completeness
o Missing attribute values or tuples.
o Example: Customer record missing phone number.
o Cause: not recorded, ignored, deleted, or lost.
3. Consistency
o Different representations of the same thing.
o Example: “Dept01” vs “D01” for the same department.
4. Timeliness
o Data should be up to date.
o Example: Sales reports submitted late → incomplete at month-end.
5. Believability
o Do users trust the data?
o Example: A database once had many errors → even after fixing, users may still
distrust it.
6. Interpretability
o How easily can users understand the data?
o Example: Database uses obscure accounting codes → hard for sales team to
interpret.
👉 Different users may perceive quality differently.
A marketing analyst may accept 80% accurate customer addresses for campaigns.
A sales manager may find the same data unreliable.
2. Major Tasks in Data Preprocessing
The four major preprocessing tasks are:
2.1 Data Cleaning
Fixes dirty (incomplete, noisy, inconsistent) data.
Methods:
Fill in missing values
o Example: If “Age” is missing:
Replace with mean/median/mode.
Use most probable value (via regression, classification).
Smooth noisy data
o Example: Sensor reading 50, 52, 500, 51 → smooth using binning or moving
average.
Identify/remove outliers
o Example: Salary of 999999 among typical 30k–80k.
Resolve inconsistencies
o Example: Date formats (12/01/24 vs 01-12-2024).
Remove duplicates
o Example: Two identical customer records.
2.2 Data Integration
Combines data from multiple sources (databases, files, data cubes).
Problems:
Schema mismatch → customer_id vs cust_id.
Value mismatch → “Bill” vs “William.”
Redundancy → same customer appearing in two datasets.
Example:
Sales database has cust_id, purchase_amount.
Customer database has customer_id, income.
👉 Integration merges them for richer analysis.
2.3 Data Reduction
Reduces the dataset size while preserving knowledge.
This makes mining faster.
Types:
1. Dimensionality Reduction (reduce attributes)
o Remove irrelevant or redundant features.
o Example: Drop “middle name” from customer dataset if not useful.
o Techniques:
Attribute subset selection
Attribute construction (derive new features, e.g., BMI from
weight/height)
PCA (Principal Component Analysis)
2. Numerosity Reduction (reduce data volume)
o Replace data with smaller representations.
o Techniques:
Parametric models: Regression, log-linear models.
Example: Fit a regression line instead of storing all points.
Non-parametric models: Histograms, clustering, sampling.
Example: Store age distribution as bins (20–29: 2000 people, 30–39:
1500 people) instead of raw ages.
2.4 Data Transformation
Convert data into formats suitable for mining.
Methods:
1. Normalization (Scaling)
o Adjust values to a smaller range (e.g., [0,1] or [-1,1]).
o Example:
Age = 25, Salary = 75,000.
Without normalization, salary dominates distance-based algorithms
(kNN, clustering).
After normalization: Age = 0.25, Salary = 0.75.
2. Discretization
o Convert continuous attributes into categories.
o Example:
Age = 25 → “Youth”
Income = 65,000 → “Medium Income.”
o Methods:
Static (predefined ranges).
Dynamic (data-driven, e.g., clustering).
3. Concept Hierarchy Generation
o Replace raw values with higher-level concepts.
o Example:
City → State → Country
Age (23, 27, 31) → “Young Adult”
3. Putting It All Together (Example Workflow)
Imagine you’re analyzing AllElectronics sales data.
1. Data Cleaning
o Fill missing “on_sale” attribute (if missing, mark as “unknown”).
o Fix date formats (2025-09-10 vs 10/09/2025).
o Remove duplicate transactions.
2. Data Integration
o Merge sales DB with customer demographics DB.
o Resolve mismatch: cust_id vs customer_id.
3. Data Reduction
o Drop irrelevant attributes like “middle name.”
o Use PCA to reduce 50 demographic features to 10 principal components.
o Sample 10% of transactions for testing.
4. Data Transformation
o Normalize attributes: age → [0,1], income → [0,1].
o Discretize income: “Low,” “Medium,” “High.”
o Generalize location: “Hyderabad” → “Telangana” → “India.”
Now, the data are clean, consistent, compact, and ready for mining.
🔹 Data Cleaning
Real-world data is often incomplete, noisy, or inconsistent.
Data cleaning (a.k.a. data cleansing) fixes these problems so mining results are reliable.
It mainly deals with:
1. Missing values
2. Noisy data
3. Inconsistent/discrepant data
1. Handling Missing Values
When some attributes have no values recorded (e.g., customer income).
Methods:
1. Ignore the tuple
o Drop the record if class label is missing.
o Bad if too many missing values → data loss.
2. Fill manually
o Possible for small datasets.
o Not scalable for big data.
3. Global constant
o Replace missing with “Unknown” or NULL.
o Risk: program may treat "Unknown" as a real category.
4. Central tendency (mean/median)
o If distribution is normal → use mean.
o If skewed → use median.
o Example: replace missing income with mean $56,000.
5. Class-wise mean/median
o Replace with average of same class.
o Example: For "High-risk" customers → use mean income of that group.
6. Most probable value (prediction models)
o Estimate using regression, Bayesian inference, or decision trees.
o Uses most info → preserves relationships best.
o Example: predict income using other customer attributes.
👉 Note: Missing ≠ error always. Sometimes it means “Not Applicable” (e.g., no driver’s
license). So metadata (rules about nulls) should guide replacement.
2. Handling Noisy Data
Noise = random error or variance in a measured variable.
Techniques:
1. Binning (local smoothing)
o Sort data, put into bins (equal width or equal frequency).
o Replace values in bin with mean, median, or boundaries.
o Example:
Original prices = [4, 8, 15] → bin mean = 9 → replaced as [9, 9, 9].
2. Regression (global smoothing)
o Fit a function (linear/multiple regression).
o Example: Predict price using other attributes.
3. Outlier analysis
o Use clustering or statistical tests.
o Values outside clusters = possible outliers.
3. Data Cleaning as a Process
It’s not just one step, but an iterative process:
Step 1: Discrepancy Detection
Causes:
o Poorly designed forms (optional fields)
o Human error
o Data decay (outdated addresses)
o Integration issues (different formats/names)
Use metadata:
o Attribute domain, range, type.
o Check mean, median, mode, skewness, std dev.
o Identify outliers & anomalies.
Step 2: Data Transformation
Fix issues via:
o Data scrubbing tools (domain knowledge, fuzzy matching, spell-checking).
o Data auditing tools (find rules/relationships → flag violations).
o Data migration & ETL tools (transform formats).
o Custom scripts (SQL, Python, etc.).
Challenges:
Iterative, error-prone.
Some fixes create new errors.
Often requires multiple iterations.
New Approaches:
Interactive cleaning tools (e.g., Potter’s Wheel):
o Spreadsheet-like, immediate feedback, undo transformations.
o Discrepancy detection in background.
Declarative languages:
o SQL extensions for specifying transformations.
Metadata updates:
o Always update metadata after cleaning → future cleaning is easier.
Data Integration (Simplified Explanation)
When we mine data, often the data comes from different sources (databases, files, data
warehouses).
Before using it, we must integrate (combine) this data carefully to avoid errors, duplication,
and inconsistencies.
Challenges in Data Integration
1. Entity Identification Problem
o Same real-world thing may have different names in different databases.
o Example:
One database: customer_id
Another database: cust_number
Both actually mean the same thing.
o Metadata (info about data: name, type, range, rules) helps in matching.
o Need to also check constraints:
Example: Discount applied to an order vs. applied to each item →
must align.
2. Redundancy & Correlation Analysis
o Sometimes, two attributes contain the same information.
o Example: annual_revenue can be derived from monthly_revenue × 12.
o To detect redundancy, use correlation tests:
For Nominal Data (categories like Male/Female, Fiction/Nonfiction):
Use Chi-square (χ²) test.
It checks if two attributes are independent or correlated.
Example: Gender vs. Preferred Reading → strong correlation
found.
For Numeric Data (numbers like income, marks):
Use Correlation Coefficient (r):
r > 0 → Positive correlation (A ↑, B ↑).
r < 0 → Negative correlation (A ↑, B ↓).
r = 0 → No correlation.
Use Covariance (measures joint variation).
Positive covariance → values rise together.
Negative covariance → one rises, other falls.
Example: Stock prices of two companies moving
together → positive covariance.
3. Tuple Duplication
o Same record may appear multiple times.
o Example: Two identical purchase orders in different systems.
o Can also occur when using denormalized tables (storing same info multiple
times).
o May lead to inconsistencies (e.g., same customer with different addresses).
4. Data Value Conflicts
o Even if the same entity is identified, values may conflict:
Representation difference → Dates (25/12/2010 vs. 2010/12/25).
Unit difference → Weight in kg vs. lbs.
Currency/scale → Room price in ₹ vs. $.
Abstraction level → Sales for a branch vs. sales for a region.
Encoding differences → Pay type stored as H/S in one DB and 1/2 in
another.
o Must apply transformation rules to make them consistent.
🌐 What is Data Reduction?
When we mine very large datasets, analysis becomes slow and expensive.
So, we reduce the data size (rows/columns/values) without losing important patterns.
👉 Goal: Make mining faster while keeping results almost the same as with the original data.
Three Main Data Reduction Strategies
1. Dimensionality Reduction
👉 Reduce the number of attributes (columns/features).
Why? Many features are irrelevant, redundant, or weakly correlated.
Techniques:
o Wavelet Transform (DWT): Compresses data by keeping only strong
coefficients.
🔹 Example: An image with 1024 pixels → DWT keeps only the top 100
coefficients → you can still reconstruct the image roughly.
o Principal Component Analysis (PCA): Creates new features (principal
components) that capture most of the data’s variance.
🔹 Example: If a dataset has Height (cm) and Weight (kg), PCA might create a
single new variable “Body size” that summarizes both.
o Attribute Subset Selection: Just drop unnecessary features.
🔹 Example: A dataset with {Name, Age, Age in Months, Height}. “Age in
Months” is redundant → remove it.
2. Numerosity Reduction
👉 Replace the dataset with a smaller, approximate form.
Parametric (model-based): Use a formula/model instead of raw data.
o Regression: Instead of storing 1M salary records, store a regression line
(Salary = 3000 + 200 × Experience).
o Log-linear Models: Approximate multi-dimensional distributions.
Non-Parametric: Use data structures instead of equations.
o Histograms: Store only frequency counts for ranges.
🔹 Example: Instead of storing 10,000 student scores, store: {0–10: 50, 10–20:
120, 20–30: 300, …}.
o Clustering: Group similar data and store only cluster centers.
🔹 Example: In a dataset of 1M customers, cluster into 50 groups → store only
averages.
o Sampling: Take a small random sample that represents the whole.
🔹 Example: Take 5,000 rows instead of 5M rows.
o Data Cube Aggregation: Pre-compute and store summary values (e.g., totals,
averages).
🔹 Example: Instead of storing every daily sales transaction, keep monthly
sales totals.
3. Data Compression
👉 Transform the data into a smaller form.
Lossless compression: Can fully reconstruct original.
🔹 Example: ZIP compression.
Lossy compression: Can only reconstruct an approximation, but good enough.
🔹 Example: JPEG image compression (not every pixel preserved, but visually same).
Note: Both dimensionality reduction and numerosity reduction are also kinds of
compression.
📌 Data Transformation & Data Discretization
🔹 What is Data Transformation?
It is the process of converting data into a more suitable format for mining.
This makes the mining process more efficient and the patterns easier to understand.
🔹 Main Data Transformation Strategies
1. Smoothing
Removes noise (random errors/outliers) from data.
Techniques:
o Binning → Group values into bins and replace with bin mean/median.
o Regression → Fit a line/curve and smooth data.
o Clustering → Group similar values and smooth within groups.
👉 Example: Exam scores = {40, 42, 38, 100, 41}.
Outlier (100) can be smoothed by binning → bin = {38–42}, mean = 40.
2. Attribute Construction (Feature Construction)
Create new attributes from existing ones to help mining.
👉 Example: From (height, weight), construct “BMI = weight / height²”.
3. Aggregation
Combine data to a higher level.
Often used in data cubes for OLAP.
👉 Example:
Daily sales → Monthly sales → Yearly sales.
4. Normalization
Scale attributes into a smaller range (e.g., [0,1] or [-1,1]).
Useful for distance-based methods (clustering, kNN, neural nets).
Techniques:
Min-Max Normalization
Formula:
v′=v−min(A)max(A)−min(A)×(new_max−new_min)+new_minv' = \frac{v - \min(A)}{\max(A)
- \min(A)} \times (new\_max - new\_min) + new\_minv′=max(A)−min(A)v−min(A)
×(new_max−new_min)+new_min
👉 Example: Income $73,600 with min=12,000, max=98,000 → normalized to 0.716 in [0,1].
Z-score Normalization (Standardization)
Formula:
v′=v−μσv' = \frac{v - \mu}{\sigma}v′=σv−μ
👉 Example: Income $73,600 with mean=54,000, SD=16,000 → z = 1.225.
Decimal Scaling
Move decimal point by factor of 10^j.
👉 Example: Value 986 → 0.986 (divided by 1000).
5. Discretization
Convert continuous values into interval labels or concept labels.
Helps in simplification & concept hierarchy generation.
👉 Example:
Age (numeric) → {0–10, 11–20, …}
Or Age → {Youth, Adult, Senior}.
6. Concept Hierarchy Generation for Nominal Data
Generalize categorical attributes into higher levels.
👉 Example:
Street → City → Country.
Product ID → Category → Department.
🔹 Data Discretization Techniques
Discretization = reducing continuous attributes into a few intervals.
It can be:
Supervised → Uses class info (e.g., Decision Tree, ChiMerge).
Unsupervised → No class info (e.g., Binning, Histogram).
Top-down (Splitting) → Start broad, split further.
Bottom-up (Merging) → Start detailed, merge intervals.
1. Binning (unsupervised, top-down)
Equal-width bins (e.g., 0–10, 10–20, 20–30).
Equal-frequency bins (e.g., 10 students per bin).
👉 Example: Income values into 3 bins of equal width.
2. Histogram Analysis (unsupervised)
Partition values into disjoint ranges (buckets).
Equal-width histogram = same size bins.
Equal-frequency histogram = same number of tuples per bin.
👉 Example: Prices bucketed into {0–100, 100–200, …}.
3. Clustering (unsupervised, data-driven)
Group attribute values into clusters → each cluster = interval.
Closer data points go into the same cluster.
👉 Example: Age values grouped into natural clusters: {0–15}, {16–35}, {36–60}, {60+}.
4. Decision Tree Analysis (supervised, top-down)
Use class labels & entropy to choose best split points.
Produces intervals that improve classification accuracy.
👉 Example: Symptom “Temperature” discretized into {<37.5 = Normal, ≥37.5 = Fever} based
on diagnosis labels.
5. Correlation Analysis (ChiMerge, supervised, bottom-up)
Merge intervals with similar class distributions.
Uses chi-square test to decide merging.
👉 Example: Age values {21–25} and {26–30} both mostly map to “Student” → merge them.
📌 Basic Statistical Descriptions of Data
👉 Before preprocessing, we need an overall picture of the data.
Statistical descriptions help us:
Understand center of the data.
Understand spread (dispersion) of the data.
Identify outliers & noise.
Visualize data distribution.
🔹 1. Measures of Central Tendency
These describe the center of the data distribution.
(a) Mean (Arithmetic Average)
Mean=x1+x2+...+xNN\text{Mean} = \frac{x_1 + x_2 + ... + x_N}{N}Mean=Nx1+x2+...+xN
👉 Example: Salaries = {30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110}
Mean=69612=58\text{Mean} = \frac{696}{12} = 58Mean=12696=58
So, average salary = $58,000.
⚡ Problem: Sensitive to outliers (e.g., 110K pushes the mean up).
✔️Fix: Use trimmed mean (remove extreme top/bottom % before averaging).
(b) Median (Middle Value)
Middle value when data is sorted.
If odd N → median = exact middle.
If even N → average of two middle values.
👉 Example: Salaries above → N=12 (even).
Middle values = 52, 56 → median = (52+56)/2 = 54 (=$54,000).
⚡ Advantage: Less affected by outliers/skewness.
(c) Mode (Most Frequent Value)
Value that occurs most often.
Data can be:
o Unimodal → 1 mode.
o Bimodal → 2 modes.
o Multimodal → >2 modes.
👉 Example: Salaries → Modes = 52, 70 → Bimodal.
💡 Relation (for moderately skewed data):
Mode≈3×Median−2×Mean\text{Mode} \approx 3 \times \text{Median} - 2 \times \
text{Mean}Mode≈3×Median−2×Mean
(d) Midrange
Average of min and max values.
Midrange=min+max2\text{Midrange} = \frac{\min + \max}{2}Midrange=2min+max
👉 Example: Salaries → min=30, max=110
Midrange=30+1102=70\text{Midrange} = \frac{30+110}{2} = 70Midrange=230+110=70
So, midrange = $70,000.
✅ Summary of Central Measures
Measure Strength Weakness
Mean Uses all data Affected by outliers
Median Resistant to outliers Ignores distribution shape
Mode Works for categorical & numeric May not be unique
Midrange Easy to compute Highly sensitive to outliers
🔹 2. Measures of Dispersion (Spread)
(You’ll see this in the next part of the book, but summarizing for context):
Range = max – min.
Quartiles & IQR = Q3 – Q1.
Variance & Standard Deviation = average squared deviation from mean.
Boxplots help visualize spread & outliers.
🔹 3. Graphical Displays
Bar charts, pie charts, line graphs → simple summaries.
Histograms → distribution of numeric data.
Scatter plots → correlation between 2 attributes.
Quantile plots, Q-Q plots → compare distributions.
🔹 Skewness of Data
Symmetric → mean = median = mode.
Positively Skewed (right tail) → mean > median > mode.
Negatively Skewed (left tail) → mean < median < mode.
👉 Example:
Salaries (with 110 as outlier) → positively skewed (few very high salaries).
1. Range
Definition: Difference between the maximum and minimum values.
Formula:
Range=max(X)−min(X)\text{Range} = \max(X) - \min(X)Range=max(X)−min(X)
Example: Salaries = {30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110}
Range=110−30=80\text{Range} = 110 - 30 = 80Range=110−30=80
🔹 2. Quartiles & Interquartile Range (IQR)
Quartiles divide ordered data into 4 equal parts.
o Q1 = 25th percentile (cuts lowest 25%)
o Q2 = Median = 50th percentile
o Q3 = 75th percentile (cuts lowest 75%)
Interquartile Range (IQR): Range of the middle 50% of data.
IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1
Example (same salaries, N=12):
o Q1 = 3rd value = 47
o Q2 = (6th + 7th)/2 = (52+56)/2 = 54
o Q3 = 9th value = 63
o IQR=63−47=16IQR = 63 - 47 = 16IQR=63−47=16
🔹 3. Five-Number Summary
Provides a quick snapshot of distribution:
{Minimum, Q1, Median, Q3, Maximum}\{ \text{Minimum, Q1, Median, Q3, Maximum} \}
{Minimum, Q1, Median, Q3, Maximum}
Example (salaries):
{30,47,54,63,110}\{30, 47, 54, 63, 110\}{30,47,54,63,110}
🔹 4. Boxplot (Graphical Display)
Visual representation of the five-number summary.
Components:
o Box → from Q1 to Q3 (IQR)
o Line inside box → Median
o Whiskers → extend to Min & Max (within 1.5 × IQR)
o Outliers → points beyond whiskers
👉 Helps to spot skewness and outliers.
🔹 5. Variance (σ²)
Definition: Average of squared deviations from the mean.
Formula:
σ2=1N∑i=1N(xi−xˉ)2\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2σ2=N1 i=1∑N(xi
−xˉ)2
Example (salaries, mean = 58):
σ2=379.17\sigma^2 = 379.17σ2=379.17
🔹 6. Standard Deviation (σ)
Definition: Square root of variance → expresses spread in original units.
Formula:
σ=σ2\sigma = \sqrt{\sigma^2}σ=σ2
Example (above data):
σ=379.17≈19.47\sigma = \sqrt{379.17} \approx 19.47σ=379.17≈19.47
🔹 7. Outlier Detection
Rule of Thumb:
o Outlier if
x<Q1−1.5×IQRorx>Q3+1.5×IQRx < Q1 - 1.5 \times IQR \quad \text{or} \quad x > Q3 + 1.5 \
times IQRx<Q1−1.5×IQRorx>Q3+1.5×IQR
Example (salaries):
o Q1 = 47, Q3 = 63, IQR = 16
o Lower fence = 47 - 1.5(16) = 23
o Upper fence = 63 + 1.5(16) = 87
o Any value > 87 = outlier → 110 is an outlier