Data Mining and Knowledge
Discovery (SDSC 8009)
Week 02 PPT
Prof. SAKAL Collin
Semester A 2025 – 2026
City University of Hong Kong (Dongguan)
Week 02 Lecture Outline
1. Exploratory Data Analysis (EDA)
2. Data pre-processing & feature engineering PPT
3. Clustering
4. Coding examples in Python
Jupyter Notebook
5. Introduction to Lets-Plot (if we have time)
Exploratory Data Analysis (EDA)
Exploratory data analysis is the process of
understanding the characteristics of data
through visual and statistical summaries,
identifying issues that need to be addressed
during data pre-processing, and generating
hypotheses about relationships within the
data.
Common Questions to Answer in EDA
1. How many rows and columns are in my data?
2. What does each column represent?
3. How many missing values are in each column?
4. Are all numeric variables in the range I would expect?
5. What are the distributions of my numeric variables?
6. Do all categorical variables contain the categories I’d expect?
7. What are the relationships between the features in my data?
EDA is often divided into univariate EDA (single variable) and multivariate
EDA (multiple-variable), both of which are done using statistical summaries
and visualizations.
Univariate Statistical Summaries
Univariate Visual Summaries
Histogram
Boxplot
Violin plot
Univariate Visual Summaries
Multivariate Statistical Summaries
Multivariate Statistical Summaries
• Contingency table example: smoking cigarettes versus lung cancer
Developed cancer Did not
Smokes 45 55
Does not 5 95
**this data is not real
Multivariate Visual Summaries
**this data is not real
Multivariate Statistical Summaries
Stratified Boxplot
**this data is not real
Data Pre-Processing
Data pre-processing involves taking raw data
and preparing it for input into prediction
and statistical models by handlining missing
and abnormal values and conducting feature
engineering.
Types of Missingness
Type of missingness Notes
Missing completely at random (MCAR) Missing values are entirely random and
not related to any of the data.
Missing at random (MAR) Missingness depends on other observed
variables in the data but not on the
missing values themselves.
Missing not at random (MNAR) The missing values are directly related to
the unobserved value itself
Note: we generally cannot prove whether data are missing completely at random or missing
at random. How we handle missing values will be domain and task dependent.
Ex: Missing at Random (MAR)
• Consider a study where participants need to take
their skin temperature every night
°C
• Old people may be more likely to forget to take
measurements which results in more missing
values
• Thus, the missingness for the skin temperature
measurements depends on the age, but not skin
temperature itself
• For any two people of the same age, the probability
of having a missing skin temperature measurement
would be equal
Ex: Missing not at Random (MNAR)
• Consider a study where we measure depression using a
mental health survey
• People with extreme depression may be less likely to fill
out the survey, resulting in missingness
• In this case, the missing value is directly dependent on
the feature depression itself
Missingness, what to do?
Strategy Notes
Do nothing Often not possible, many algorithms
cannot handle missing values
Delete rows with missing values Could result in a large amount of data loss
if there are many missing values
Impute missing values (replace them with Will be discussed in detail later.
another number)
Note: however we deal with missing values, we want to try and ensure we
preserve the underlying relationships between the features in our data.
Imputing Missing Values
Imputation Strategy Notes
Univariable Impute missing values for a feature
using only data from that feature.
Multivariable Impute missing values for a feature
using other features in the data set
Imputation Methods
Imputation Univariate or Notes
Strategy Multivariate
Mean/median Univariate Impute missing values with the mean or
imputation median for numeric features
Mode imputation Univariate Impute missing values with the mode for
categorical features
Algorithmic Usually multivariate Example: use a regression model trained
imputation on other features in the data to predict
what the missing value should be
Handling Abnormal Values
• Examples
• Numerical outliers
• Unexpected categories
• Impossible values
• Missing values can be considered “abnormal”
• Sometimes abnormal values are informative (they help us understand
how to predict or model relationships in data)
• Other times abnormal values may bias our analyses, especially if they
are impossible or improbable
Abnormal values, what to do?
Feature Engineering
Feature Engineering is the process of taking
our data and encoding each feature in such
a way that will maximize the predictive
performance of our machine learning
models
Encoding Categorical Features
• Many algorithms will not accept categorical (string) features
• We need to meaningfully represent categories using numbers
• Common strategies
• Ordinal encoding
• Dummy encoding
• Advanced strategies
• Target encoding
• Other supervised/unsupervised algorithmic methods
Ordinal Encoding
• Replace values of categorical features with ordinal numbers
Education Education
BS 1
MSc 2
MSc 2
PhD 3
One-Hot Encoding
Take a categorical feature with C categories and convert it into C new
binary features indicating whether an observation was in that category.
Education Education_BS Education_MSc Education_PhD
BS 1 0 0
MSc 0 1 0
MSc 0 1 0
PhD 0 0 1
Dummy Encoding
• Converting a feature with C categories into C new features (One-Hot
Encoding) will create a feature matrix that is not full rank. Some
algorithms will return errors in this scenario.
• Solution: make C-1 new features instead
Education Education_BS Education_MSc
BS 1 0
MSc 0 1
MSc 0 1
PhD 0 0
Encoding Numeric Features
• Most algorithms will accept numeric features in their original form,
but sometimes modifying the feature can improve predictive
performance
• Common strategies
• Scaling and/or centering
• Distribution-based transformations
• Interactions
• Non-linear representations
• For some algorithms, features need to be on the same scale (ie
between 0 and 1) before we can make predictions.
Scaling and Centering
Transformations
Interactions
• Interactions occur when the relationship between one feature
with a target variable depends on the value of another feature.
Categorical The relationship
feature between the
Target
numeric feature
and the target
depends on the
categorical feature.
Numeric feature
Non-Linear Representations
• Some relationships between features are target variables are non-
linear. If our prediction algorithm cannot automatically account for
non-linearity, we must do so ourselves.
Example: the relationship
between step counts per day (x-
axis) and the risk of dying (y-axis)
is non-linear
Clustering
• Clustering takes data points and assigns
them into groups (“clusters”).
• Clustering for understanding: identify
clusters that are meaningful and help
understand the data more clearly.
• Clustering for utility: identify clusters that
simplify or make other analyses easier.
Clustering Example (Understanding)
• Identifying groups of people with similar sleep patterns and then using
statistical models to examine the relationships of the groups with cancer
Smartwatch Clusters
sleep data Examine
Group 1: early sleepers
relationships
Group 2: late sleepers with cancer
Group 3: short sleepers & nappers using statistical
models
Challenges of Clustering
• The definition of a “cluster” is
often not clear
• If we want to cluster for
understanding, we need to make
sure the clusters meaningfully
represent the data
• Different clustering algorithms
can provide very different results
Useful Clustering Definitions
• Partitional versus hierarchical
• Partitional – groups are distinct and non-overlapping
• Hierarchical – groups are nested (tree-like structure)
• Exclusive, overlapping, and fuzzy clustering
• Exclusive – each object is assigned to one cluster
• Overlapping – objects can be assigned to >1 clusters
• Fuzzy – group membership is defined by a probability
• Complete versus partial
• Complete – every object is assignment to a cluster
• Partial – not every object is assigned to a cluster
Common Clustering Algorithms
• K-Means
• Agglomerative Hierarchical Clustering
• DBSCAN
K-Means Clustering
• Center-based clusters – we assign points to clusters based on their
similarity to a central point (centroid) within each cluster
• Centroid – the center of a cluster, defined by the mean (usually)
Algorithm
1. Select K points as initial centroids (K = number of clusters you want)
2. Repeat
3. Calculate distance between every point and every centroid
4. Assign every point to a cluster based on the closest centroid
5. Recompute the centroid based as the mean
6. Stop when: Centroids do not change (or when <X% change)
K-Means Illustration
Iteration 1 Iteration 2 Iteration 3
= Centroid
K-Means Illustration
Iteration 4 Iteration 5 Iteration 6
= Centroid
K-Means Illustration
Start (iteration 1) End (iteration 6)
= Centroid
K-Means Drawbacks
• The number of clusters (K) you tell K-Means to find
may not meaningfully represent “real” clusters in
the data
• The final clusters will change based on the initially
specified centroids
• The k-means++ algorithm provides a more robust way
of selecting initial centroids.
• Sensitive to outliers
K-Means Mathematical Details
K-Means Mathematical Details
Hierarchical Clustering
• Nested clusters that can be Data = A,B,C
visualized using a tree structure
(dendrogram).
• Strategies: A B, C
• Agglomerative: start with all points
as clusters and merge until only one
cluster is left (bottom of tree -> top) B C
• Divisive: start with one all-inclusive
cluster and split until each cluster
contains one point (top of tree -> A B C
bottom)
Hierarchical Clustering
• Does not depend on a pre-specified Data = A,B,C
number of clusters like K-Means
• We can decide where to “cut” the A B, C
tree to define the groups
• However, it does not minimize a
global objective function, making
the “goodness” of the clusters
A B C
difficult to determine.
Agglomerative Hierarchical Clustering
Algorithm
1. Compute the proximity matrix Note: the proximity
2. Let each data point be a cluster matrix quantifies the
distance between
3. Repeat clusters in the data
4. Merge 2 closest clusters
5. Update proximity matrix
6. Stop when: only a single cluster is left
Agglomerative Hierarchical Clustering
• One challenge is defining the “distance” between clusters
• MIN: distance between closest points in different clusters
• MAX: distance between furthest points in different clusters
• Group average: average pairwise proximity
• Centroid-based approaches: (not shown)
MIN MAX Group average
DBSCAN
• DBSCAN intuition: we can define clusters based on regions of high
density that are separated by regions of lower density
What regions do
you think should
belong to different
clusters?
DBSCAN Definitions
Term Definition
Density Number of points within a given
radius (Eps)
Core point A point that has a pre-specified
number of point (MinPts) within a
radius (Eps)
Border point A point that is not a core point, but
is within the neighborhood of a
core point
Noise point Any point that is not a core nor a
border point
Visual Intuition
Eps = 10, MinPts = 4
DBSCAN
Algorithm
1. Label all points as core, border, or noise
2. Eliminate noise points
3. Put an edge between all core points within a
distance Eps
4. Make each group of connected core points a
separate cluster
5. Assign border points to one of the clusters of
associated core points
DBSCAN Results
Original Clusters
** Dark blue = noise points. We can also use DBSCAN to identify outliers!!
DBSCAN Difficulties
1. Determining the radius (Eps) and
number of points (MinPts) used to
define core, border, and noise
points.
2. Struggles when regions have varying
densities
3. Struggles with high dimensional
data (“density” is trickier in higher
dimensions)
I Have Clusters, Now What?
• Clustering algorithms will return clusters for random data
• There is a risk the clusters you derive are meaningless
• Strategies for determining cluster validity:
• Cluster cohesion: are points within a cluster close together?
• Cluster separation: are points in different clusters far apart?
Cohesion Separation
Other Cluster Validity Measures
• If clustering is a part of your feature engineering
task for a machine learning model, you can examine
how the clusters relate to your target variable on
the training data.
• Can also visually examine correlation between
proximity matrix and an “ideal” similarity matrix (1
if pair of points belong to the same cluster and 0 if
they do not).
Now onto coding!