0% found this document useful (0 votes)

6 views55 pages

Slides Concepts

Uploaded by

wenyansong0817

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views55 pages

Slides Concepts

Uploaded by

wenyansong0817

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Data Mining and Knowledge

Discovery (SDSC 8009)

Week 02 PPT
Prof. SAKAL Collin
Semester A 2025 – 2026
City University of Hong Kong (Dongguan)
Week 02 Lecture Outline
1. Exploratory Data Analysis (EDA)
2. Data pre-processing & feature engineering PPT
3. Clustering
4. Coding examples in Python
Jupyter Notebook
5. Introduction to Lets-Plot (if we have time)
Exploratory Data Analysis (EDA)

Exploratory data analysis is the process of

understanding the characteristics of data
through visual and statistical summaries,
identifying issues that need to be addressed
during data pre-processing, and generating
hypotheses about relationships within the
data.
Common Questions to Answer in EDA
1. How many rows and columns are in my data?
2. What does each column represent?
3. How many missing values are in each column?
4. Are all numeric variables in the range I would expect?
5. What are the distributions of my numeric variables?
6. Do all categorical variables contain the categories I’d expect?
7. What are the relationships between the features in my data?

EDA is often divided into univariate EDA (single variable) and multivariate
EDA (multiple-variable), both of which are done using statistical summaries
and visualizations.
Univariate Statistical Summaries
Univariate Visual Summaries

Histogram

Boxplot

Violin plot
Univariate Visual Summaries
Multivariate Statistical Summaries
Multivariate Statistical Summaries
• Contingency table example: smoking cigarettes versus lung cancer

Developed cancer Did not

Smokes 45 55

Does not 5 95

**this data is not real

Multivariate Visual Summaries

**this data is not real

Multivariate Statistical Summaries
Stratified Boxplot

**this data is not real

Data Pre-Processing

Data pre-processing involves taking raw data

and preparing it for input into prediction
and statistical models by handlining missing
and abnormal values and conducting feature
engineering.
Types of Missingness
Type of missingness Notes
Missing completely at random (MCAR) Missing values are entirely random and
not related to any of the data.

Missing at random (MAR) Missingness depends on other observed

variables in the data but not on the
missing values themselves.
Missing not at random (MNAR) The missing values are directly related to
the unobserved value itself

Note: we generally cannot prove whether data are missing completely at random or missing
at random. How we handle missing values will be domain and task dependent.
Ex: Missing at Random (MAR)
• Consider a study where participants need to take
their skin temperature every night
°C
• Old people may be more likely to forget to take
measurements which results in more missing
values
• Thus, the missingness for the skin temperature
measurements depends on the age, but not skin
temperature itself
• For any two people of the same age, the probability
of having a missing skin temperature measurement
would be equal
Ex: Missing not at Random (MNAR)
• Consider a study where we measure depression using a
mental health survey
• People with extreme depression may be less likely to fill
out the survey, resulting in missingness
• In this case, the missing value is directly dependent on
the feature depression itself
Missingness, what to do?
Strategy Notes
Do nothing Often not possible, many algorithms
cannot handle missing values

Delete rows with missing values Could result in a large amount of data loss
if there are many missing values

Impute missing values (replace them with Will be discussed in detail later.
another number)

Note: however we deal with missing values, we want to try and ensure we
preserve the underlying relationships between the features in our data.
Imputing Missing Values
Imputation Strategy Notes

Univariable Impute missing values for a feature

using only data from that feature.

Multivariable Impute missing values for a feature

using other features in the data set
Imputation Methods
Imputation Univariate or Notes
Strategy Multivariate
Mean/median Univariate Impute missing values with the mean or
imputation median for numeric features

Mode imputation Univariate Impute missing values with the mode for
categorical features

Algorithmic Usually multivariate Example: use a regression model trained

imputation on other features in the data to predict
what the missing value should be
Handling Abnormal Values
• Examples
• Numerical outliers
• Unexpected categories
• Impossible values
• Missing values can be considered “abnormal”
• Sometimes abnormal values are informative (they help us understand
how to predict or model relationships in data)
• Other times abnormal values may bias our analyses, especially if they
are impossible or improbable
Abnormal values, what to do?
Feature Engineering

Feature Engineering is the process of taking

our data and encoding each feature in such
a way that will maximize the predictive
performance of our machine learning
models
Encoding Categorical Features
• Many algorithms will not accept categorical (string) features
• We need to meaningfully represent categories using numbers
• Common strategies
• Ordinal encoding
• Dummy encoding
• Advanced strategies
• Target encoding
• Other supervised/unsupervised algorithmic methods
Ordinal Encoding
• Replace values of categorical features with ordinal numbers

Education Education
BS 1
MSc 2
MSc 2
PhD 3
One-Hot Encoding
Take a categorical feature with C categories and convert it into C new
binary features indicating whether an observation was in that category.

Education Education_BS Education_MSc Education_PhD

BS 1 0 0
MSc 0 1 0
MSc 0 1 0
PhD 0 0 1
Dummy Encoding
• Converting a feature with C categories into C new features (One-Hot
Encoding) will create a feature matrix that is not full rank. Some
algorithms will return errors in this scenario.
• Solution: make C-1 new features instead
Education Education_BS Education_MSc
BS 1 0
MSc 0 1
MSc 0 1
PhD 0 0
Encoding Numeric Features
• Most algorithms will accept numeric features in their original form,
but sometimes modifying the feature can improve predictive
performance
• Common strategies
• Scaling and/or centering
• Distribution-based transformations
• Interactions
• Non-linear representations

• For some algorithms, features need to be on the same scale (ie

between 0 and 1) before we can make predictions.
Scaling and Centering
Transformations
Interactions
• Interactions occur when the relationship between one feature
with a target variable depends on the value of another feature.

Categorical The relationship

feature between the
Target

numeric feature
and the target
depends on the
categorical feature.

Numeric feature
Non-Linear Representations
• Some relationships between features are target variables are non-
linear. If our prediction algorithm cannot automatically account for
non-linearity, we must do so ourselves.

Example: the relationship

between step counts per day (x-
axis) and the risk of dying (y-axis)
is non-linear
Clustering
• Clustering takes data points and assigns
them into groups (“clusters”).

• Clustering for understanding: identify

clusters that are meaningful and help
understand the data more clearly.

• Clustering for utility: identify clusters that

simplify or make other analyses easier.
Clustering Example (Understanding)
• Identifying groups of people with similar sleep patterns and then using
statistical models to examine the relationships of the groups with cancer

Smartwatch Clusters
sleep data Examine
Group 1: early sleepers
relationships
Group 2: late sleepers with cancer
Group 3: short sleepers & nappers using statistical
models
Challenges of Clustering
• The definition of a “cluster” is
often not clear
• If we want to cluster for
understanding, we need to make
sure the clusters meaningfully
represent the data
• Different clustering algorithms
can provide very different results
Useful Clustering Definitions
• Partitional versus hierarchical
• Partitional – groups are distinct and non-overlapping
• Hierarchical – groups are nested (tree-like structure)
• Exclusive, overlapping, and fuzzy clustering
• Exclusive – each object is assigned to one cluster
• Overlapping – objects can be assigned to >1 clusters
• Fuzzy – group membership is defined by a probability
• Complete versus partial
• Complete – every object is assignment to a cluster
• Partial – not every object is assigned to a cluster
Common Clustering Algorithms
• K-Means
• Agglomerative Hierarchical Clustering
• DBSCAN
K-Means Clustering
• Center-based clusters – we assign points to clusters based on their
similarity to a central point (centroid) within each cluster
• Centroid – the center of a cluster, defined by the mean (usually)

Algorithm
1. Select K points as initial centroids (K = number of clusters you want)
2. Repeat
3. Calculate distance between every point and every centroid
4. Assign every point to a cluster based on the closest centroid
5. Recompute the centroid based as the mean
6. Stop when: Centroids do not change (or when <X% change)
K-Means Illustration
Iteration 1 Iteration 2 Iteration 3

= Centroid
K-Means Illustration
Iteration 4 Iteration 5 Iteration 6

= Centroid
K-Means Illustration
Start (iteration 1) End (iteration 6)

= Centroid
K-Means Drawbacks
• The number of clusters (K) you tell K-Means to find
may not meaningfully represent “real” clusters in
the data
• The final clusters will change based on the initially
specified centroids
• The k-means++ algorithm provides a more robust way
of selecting initial centroids.
• Sensitive to outliers
K-Means Mathematical Details
K-Means Mathematical Details
Hierarchical Clustering
• Nested clusters that can be Data = A,B,C
visualized using a tree structure
(dendrogram).
• Strategies: A B, C
• Agglomerative: start with all points
as clusters and merge until only one
cluster is left (bottom of tree -> top) B C

• Divisive: start with one all-inclusive

cluster and split until each cluster
contains one point (top of tree -> A B C
bottom)
Hierarchical Clustering
• Does not depend on a pre-specified Data = A,B,C
number of clusters like K-Means

• We can decide where to “cut” the A B, C

tree to define the groups

• However, it does not minimize a

global objective function, making
the “goodness” of the clusters
A B C
difficult to determine.
Agglomerative Hierarchical Clustering
Algorithm
1. Compute the proximity matrix Note: the proximity
2. Let each data point be a cluster matrix quantifies the
distance between
3. Repeat clusters in the data

4. Merge 2 closest clusters

5. Update proximity matrix
6. Stop when: only a single cluster is left
Agglomerative Hierarchical Clustering
• One challenge is defining the “distance” between clusters
• MIN: distance between closest points in different clusters
• MAX: distance between furthest points in different clusters
• Group average: average pairwise proximity
• Centroid-based approaches: (not shown)

MIN MAX Group average

DBSCAN
• DBSCAN intuition: we can define clusters based on regions of high
density that are separated by regions of lower density

What regions do
you think should
belong to different
clusters?
DBSCAN Definitions
Term Definition
Density Number of points within a given
radius (Eps)
Core point A point that has a pre-specified
number of point (MinPts) within a
radius (Eps)
Border point A point that is not a core point, but
is within the neighborhood of a
core point
Noise point Any point that is not a core nor a
border point
Visual Intuition
Eps = 10, MinPts = 4
DBSCAN
Algorithm
1. Label all points as core, border, or noise
2. Eliminate noise points
3. Put an edge between all core points within a
distance Eps
4. Make each group of connected core points a
separate cluster
5. Assign border points to one of the clusters of
associated core points
DBSCAN Results

Original Clusters
** Dark blue = noise points. We can also use DBSCAN to identify outliers!!
DBSCAN Difficulties
1. Determining the radius (Eps) and
number of points (MinPts) used to
define core, border, and noise
points.
2. Struggles when regions have varying
densities
3. Struggles with high dimensional
data (“density” is trickier in higher
dimensions)
I Have Clusters, Now What?
• Clustering algorithms will return clusters for random data
• There is a risk the clusters you derive are meaningless
• Strategies for determining cluster validity:
• Cluster cohesion: are points within a cluster close together?
• Cluster separation: are points in different clusters far apart?

Cohesion Separation
Other Cluster Validity Measures
• If clustering is a part of your feature engineering
task for a machine learning model, you can examine
how the clusters relate to your target variable on
the training data.

• Can also visually examine correlation between

proximity matrix and an “ideal” similarity matrix (1
if pair of points belong to the same cluster and 0 if
they do not).
Now onto coding!

Weekly Quiz 1 Machine Learning Great Learning PDF
100% (2)
Weekly Quiz 1 Machine Learning Great Learning PDF
7 pages
PRESENTATION SIP - Presentation, Analysis and Interpretation of Data
100% (2)
PRESENTATION SIP - Presentation, Analysis and Interpretation of Data
66 pages
LIQUIDITY POSITION OF GLOBAL IME BANKb
80% (10)
LIQUIDITY POSITION OF GLOBAL IME BANKb
32 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Outline of Action Research
100% (3)
Outline of Action Research
53 pages
A Study About The Use of Direct and Indirect Speech in Students (Case Study Reserach)
100% (1)
A Study About The Use of Direct and Indirect Speech in Students (Case Study Reserach)
15 pages
Unit 4
No ratings yet
Unit 4
65 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Article 16 Assessment of Family Income PDF
No ratings yet
Article 16 Assessment of Family Income PDF
16 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Haldiram's Marketing Strategy Report
0% (1)
Haldiram's Marketing Strategy Report
63 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Lec 5
No ratings yet
Lec 5
24 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
55 pages
Data Mining Exercises - Solutions
No ratings yet
Data Mining Exercises - Solutions
5 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Cengage EBA 2e Chapter04
No ratings yet
Cengage EBA 2e Chapter04
35 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Health Statistics Revision Questions
100% (2)
Health Statistics Revision Questions
8 pages
Introduction To Simulation
100% (1)
Introduction To Simulation
43 pages
Unsupervised Learning Essentials
No ratings yet
Unsupervised Learning Essentials
29 pages
Intro to Randomized Design & ANOVA
100% (4)
Intro to Randomized Design & ANOVA
5 pages
Carbon Monoxide Emissions Testing
No ratings yet
Carbon Monoxide Emissions Testing
8 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Machine Learning, AI & Its Applications: Live Online Instructor-Led Training On
No ratings yet
Machine Learning, AI & Its Applications: Live Online Instructor-Led Training On
6 pages
Establishing Requirements
No ratings yet
Establishing Requirements
38 pages
SPL Measurement Lab Report
No ratings yet
SPL Measurement Lab Report
4 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Module BI1-M2 Data Warehouse Architecture: SAP University Alliances
No ratings yet
Module BI1-M2 Data Warehouse Architecture: SAP University Alliances
15 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Competency-Based Model For Predicting Construction Project Managers' Performance
No ratings yet
Competency-Based Model For Predicting Construction Project Managers' Performance
8 pages
Reading Skills 4
No ratings yet
Reading Skills 4
6 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Econometric Analysis of Housing Prices
No ratings yet
Econometric Analysis of Housing Prices
16 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Chapter 4 Final
No ratings yet
Chapter 4 Final
11 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Deeptex Industries
No ratings yet
Deeptex Industries
78 pages
Cheatsheet FDA A4 Full
No ratings yet
Cheatsheet FDA A4 Full
2 pages
Project 5 Surabhi Sood - Report
No ratings yet
Project 5 Surabhi Sood - Report
34 pages
CE880 Lecture3 Slides
No ratings yet
CE880 Lecture3 Slides
44 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Role-Play Boosts English Skills
No ratings yet
Role-Play Boosts English Skills
15 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Biostatistics End of Semester Exam Notes
No ratings yet
Biostatistics End of Semester Exam Notes
26 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
DWH KOE - 093 Tutorial and Assignment
No ratings yet
DWH KOE - 093 Tutorial and Assignment
16 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Recap of Machine Learning
No ratings yet
Recap of Machine Learning
29 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
Research Proposal
No ratings yet
Research Proposal
25 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Linear Regression (Single and Multiple Variables)
No ratings yet
Linear Regression (Single and Multiple Variables)
5 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Statistics A Tool For Social Researchers 4 Canadian Riva Lieflander Joseph Healey Steven Prus Instant Download
No ratings yet
Statistics A Tool For Social Researchers 4 Canadian Riva Lieflander Joseph Healey Steven Prus Instant Download
89 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Project Report Format 2025
No ratings yet
Project Report Format 2025
3 pages
Multilevel and Longitudinal Modeling Using Stata Fourth Edition Volumes I and II Sophia Rabe Hesketh Anders Skrondal Instant Download
No ratings yet
Multilevel and Longitudinal Modeling Using Stata Fourth Edition Volumes I and II Sophia Rabe Hesketh Anders Skrondal Instant Download
159 pages

Slides Concepts

Uploaded by

Slides Concepts

Uploaded by

Data Mining and Knowledge

Discovery (SDSC 8009)

Exploratory data analysis is the process of

Developed cancer Did not

**this data is not real

**this data is not real

**this data is not real

Data pre-processing involves taking raw data

Missing at random (MAR) Missingness depends on other observed

Univariable Impute missing values for a feature

Multivariable Impute missing values for a feature

Algorithmic Usually multivariate Example: use a regression model trained

Feature Engineering is the process of taking

Education Education_BS Education_MSc Education_PhD

• For some algorithms, features need to be on the same scale (ie

Categorical The relationship

Example: the relationship

• Clustering for understanding: identify

• Clustering for utility: identify clusters that

• Divisive: start with one all-inclusive

• We can decide where to “cut” the A B, C

• However, it does not minimize a

4. Merge 2 closest clusters

MIN MAX Group average

• Can also visually examine correlation between

You might also like