Introduction to
Machine Learning and Data Mining
(Học máy và Khai phá dữ liệu)
School of Information and Communication Technology
Hanoi University of Science and Technology
2025
2
Content
Introduction to Machine Learning & Data Mining
Data crawling and pre-processing
Supervised learning
Unsupervised learning
Practical advice
3
Time resources
How much time is spent on data
analysis?
Data collection: 19%
Data organization and cleaning:
60%
Training dataset creation: 3%
Mining: 9%
Algorithm improvement: 4%
Other: 5%
4
Why?
What is preprocessing for?
Convenient for storage and querying
Machine learning models often work with structured data: matrices, vectors,
strings, etc.
Machine learning often works effectively if there is a suitable data
representation
Input Output
Original problem Numerical format: Vector, matrix…
[ ][ ]
−0.0920 (1 )
𝑥
𝒟=
3.4931
( 𝑛 ) − 1.8493 (2 )
𝑥 =¿ 𝑥
…
…
¿ − 0.2010 ( )
−1.3079 ¿𝑥𝑛
5
How?
Data collection
Sampling
Techniques: crawling, logging, scraping
Data processing
Noise filtering, cleaning, digitizing,…
6
Data collection
Input Output
Original problem Data samples
7
Fundamentals :: Sampling
WHAT – take a small, “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”
popular sample set to
represent the area to be
studied.
WHY – cannot study the
whole thing. Limited by
time and computing
power
HOW – collect samples
from reality, or sources https://www.coursera.org/learn/inferential-statistics-intro
containing web data,
databases,…
8
Fundamentals :: Sampling :: How
Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup
is good or not.”
Remember to stir to avoid tasting biases.
set obtained is
diverse enough to
cover all contexts of
the domain.
Bias – the data
needs to be general,
not biased towards
a small part of the
domain.
https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
Variety – are the samples
diverse enough to reflect
objectivity?
Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
1
0
Techniques
Crowd-sourcing: Survey – conduct surveys
Logging: save user interaction history, product access,…
Scrapping: search for data sources on websites, download,
extract, filter,…
1
1
Techniques :: Scrapping :: DEMO
Objective: Data for text classification problem –
newspaper domain.
DEMO: Newspaper data crawling system
1
2
DEMO
Input Output
Problem: Classify Data sample: press and
documents corresponding label
1
3
DEMO :: Steps
Rss Item Content
1
4
DEMO :: Sample
1
5
Data preprocessing
[ 𝒟] =¿
Input Output
Raw data Numerical format
−0.0920
3.4931
( 𝑛 ) − 1.8493
𝑥 =¿ …
¿ − 0.2010
−1.3079
1
6
Fundamentals :: Data “rawness”
Completeness Integrity
The source of collection is authentic,
ensuring that the collected sample
Each collected sample should have contains the correct value in reality.
complete information of the necessary
attribute fields. Jan. 1 as everyone’s birthday?
– intentional (systematic) noises
Homogeneity Structures
(cấu trúc)
Rating “1, 2, 3” & “A, B, C”;
or Age = “42” & Birthday =
“03/07/2010” inconsistency
Heterogenous data sources /
schemas
1
7
Techniques
Cleaning
Integrating
Transforming
1
8
Techniques :: Cleaning
Completeness + Integrity • Data samples should be
collected from reliable sources.
Reflect the problem to be
solved.
• Eliminate noise (outliers):
remove some data samples
that are significantly different
from other samples.
• A data sample may be empty
(missing, incomplete), need
appropriate strategy:
• Ignore, do not include in
analysis?
• Add missing fields to the
sample?
1
9
Techniques :: Cleaning
Fill missing values Fill in the value manually
Assign a special or out-of-range label
value
Assign it the mean value.
Assign the mean value of other
samples in the same class.
Find the value with the highest
probability to fill in the missing space
(regression, Bayesian inference,…)
A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? ? 60 1.573 0 0.639 7 30 5
? 3.096 67 0.249 0 0.089 ? 80 3
2.887 3.870 68 -1.347 ? 1.276 ? 60 5
2.731 3.945 79 1.967 1 2.487 ? 100 4
2
0
Techniques :: Cleaning (cont.)
Homogeneity
Data samples need to be uniform
in representation and notation.
Non-uniform examples:
Rating “1, 2, 3” & “A, B, C”;
Age = 42 & Birthday = 03/08/2020
2
1
Techniques :: Integrating w/ some Transforming
`` Un-structured
texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …
image credits: wikipedia, shutterstock, CNN
2
2
Techniques :: Transforming
Semantics?
Extract semantic features, normalize
2
3
Semantics example: visual data
Mid-/High-level semantics
``
Low-level semantics
(raw pixels) (e.g. human-interpretable features)
cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22
cat not on car
people behind building
car is red
Minimum Semantic Levels for
Understanding:
• Text Classification
• Sentiment Analysis
• AI Chatbot (Multiple
Semantic Levels) Image credits: CS231n, Stanford University; Lee et al, 2009; Socher et al, 2011
2
4
Techniques :: Transforming (cont.)
Goal: extract semantic features.
• Each specific field and each type of
data uses different semantic
feature extraction techniques (text
data, images, ...)
… and standardize
• Feature discretization: some attributes
are more effective when their values are
grouped.
• Feature normalization: standardizes
attribute values to the same value
domain, making it easier to calculate.
One-hot encoding
𝑥 −𝑥
𝑠
2
5
Techniques :: Transforming (cont.)
Dimensionality reduction:
Helps reduce the size of data while retaining the core
semantics of the data.
Helps speed up the learning or knowledge discovery process.
Some strategies:
Feature selection: irrelevant, redundant or dimensional
attributes can also be deleted or eliminated
Dimension reduction: using some algorithms (e.g. PCA, ICA,
LDA, ...) to transform the original data into a space with fewer
dimensions.
Abstraction: Raw data values are replaced with abstract
concepts.
2
6
Techniques :: Transforming
example & demo
Transforming text data
2
7
DEMO
Input Output
Raw: json text Numerical reprensetation
ML/AI model(s)
2
8
DEMO :: Steps
Data Input
Tokenize Dictionary
(tfidf-Vector)
2
9
DEMO :: Exercise
Problem: Compute vector representation of text with small
data set.
Data: 2 articles from Dan Tri page.
Requirements:
• Use the word segmentation module.
• Build a dictionary from 2 documents
• Use stopwords to filter out stopwords.
• Convert 2 documents into 2 tfidf vectors
Summary 3
0
(Take-home messages)
Data in a field before entering the machine learning
system must be collected and represented in a structured
form with some characteristics: complete, less noisy,
consistent, and has a defined structure.
The data collected for the learning process is a small set,
but it needs to fully reflect the aspects of the problem to be
solved.
The raw data after collection and pre-processing must
retain the fullness of semantic features - features that
affect the ability to solve the problem.