Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views30 pages

L2 Data Crawling Preprocessing

The document provides an introduction to machine learning and data mining, covering topics such as data crawling, pre-processing, supervised and unsupervised learning. It emphasizes the importance of structured data for effective machine learning and outlines techniques for data collection, cleaning, and transformation. Key takeaways include the necessity for data to be complete, consistent, and reflective of the problem domain before being used in machine learning models.

Uploaded by

hoangvietbui2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

L2 Data Crawling Preprocessing

The document provides an introduction to machine learning and data mining, covering topics such as data crawling, pre-processing, supervised and unsupervised learning. It emphasizes the importance of structured data for effective machine learning and outlines techniques for data collection, cleaning, and transformation. Key takeaways include the necessity for data to be complete, consistent, and reflective of the problem domain before being used in machine learning models.

Uploaded by

hoangvietbui2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to

Machine Learning and Data Mining


(Học máy và Khai phá dữ liệu)

School of Information and Communication Technology


Hanoi University of Science and Technology
2025
2
Content
Introduction to Machine Learning & Data Mining

Data crawling and pre-processing


Supervised learning
Unsupervised learning

Practical advice
3
Time resources

 How much time is spent on data


analysis?
 Data collection: 19%
 Data organization and cleaning:
60%
 Training dataset creation: 3%
 Mining: 9%
 Algorithm improvement: 4%
 Other: 5%
4
Why?
What is preprocessing for?
 Convenient for storage and querying
 Machine learning models often work with structured data: matrices, vectors,
strings, etc.
 Machine learning often works effectively if there is a suitable data
representation
Input Output
Original problem Numerical format: Vector, matrix…

[ ][ ]
−0.0920 (1 )
𝑥
𝒟=
3.4931
( 𝑛 ) − 1.8493 (2 )
𝑥 =¿ 𝑥


¿ − 0.2010 ( )
−1.3079 ¿𝑥𝑛
5
How?
 Data collection
 Sampling
 Techniques: crawling, logging, scraping
 Data processing
 Noise filtering, cleaning, digitizing,…
6
Data collection

Input Output
Original problem Data samples
7
Fundamentals :: Sampling
WHAT – take a small, “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”

popular sample set to


represent the area to be
studied.
WHY – cannot study the
whole thing. Limited by
time and computing
power
HOW – collect samples
from reality, or sources https://www.coursera.org/learn/inferential-statistics-intro

containing web data,


databases,…
8
Fundamentals :: Sampling :: How
Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup
is good or not.”
Remember to stir to avoid tasting biases.
set obtained is
diverse enough to
cover all contexts of
the domain.
Bias – the data
needs to be general,
not biased towards
a small part of the
domain.
https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
Variety – are the samples
diverse enough to reflect
objectivity?

Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
1
0
Techniques
 Crowd-sourcing: Survey – conduct surveys
 Logging: save user interaction history, product access,…
 Scrapping: search for data sources on websites, download,
extract, filter,…
1
1

Techniques :: Scrapping :: DEMO

 Objective: Data for text classification problem –


newspaper domain.
 DEMO: Newspaper data crawling system
1
2
DEMO

Input Output
Problem: Classify Data sample: press and
documents corresponding label
1
3
DEMO :: Steps

Rss Item Content


1
4
DEMO :: Sample
1
5
Data preprocessing

[ 𝒟] =¿
Input Output
Raw data Numerical format

−0.0920
3.4931
( 𝑛 ) − 1.8493
𝑥 =¿ …
¿ − 0.2010
−1.3079
1
6
Fundamentals :: Data “rawness”
Completeness Integrity
The source of collection is authentic,
ensuring that the collected sample
Each collected sample should have contains the correct value in reality.
complete information of the necessary
attribute fields.  Jan. 1 as everyone’s birthday?
– intentional (systematic) noises

Homogeneity Structures
(cấu trúc)
 Rating “1, 2, 3” & “A, B, C”;
or Age = “42” & Birthday =
“03/07/2010”  inconsistency

 Heterogenous data sources /


schemas
1
7

Techniques

Cleaning
Integrating
Transforming
1
8
Techniques :: Cleaning
Completeness + Integrity • Data samples should be
collected from reliable sources.
Reflect the problem to be
solved.
• Eliminate noise (outliers):
remove some data samples
that are significantly different
from other samples.
• A data sample may be empty
(missing, incomplete), need
appropriate strategy:
• Ignore, do not include in
analysis?
• Add missing fields to the
sample?
1
9
Techniques :: Cleaning
Fill missing values  Fill in the value manually
 Assign a special or out-of-range label
value
 Assign it the mean value.
 Assign the mean value of other
samples in the same class.
 Find the value with the highest
probability to fill in the missing space
(regression, Bayesian inference,…)

A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? ? 60 1.573 0 0.639 7 30 5
? 3.096 67 0.249 0 0.089 ? 80 3
2.887 3.870 68 -1.347 ? 1.276 ? 60 5
2.731 3.945 79 1.967 1 2.487 ? 100 4
2
0
Techniques :: Cleaning (cont.)
Homogeneity
Data samples need to be uniform
in representation and notation.

Non-uniform examples:

Rating “1, 2, 3” & “A, B, C”;

Age = 42 & Birthday = 03/08/2020


2
1
Techniques :: Integrating w/ some Transforming

`` Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …

image credits: wikipedia, shutterstock, CNN


2
2
Techniques :: Transforming
Semantics?
Extract semantic features, normalize
2
3
Semantics example: visual data
Mid-/High-level semantics
``
Low-level semantics
(raw pixels) (e.g. human-interpretable features)

cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22

cat not on car


people behind building
car is red

Minimum Semantic Levels for


Understanding:
• Text Classification
• Sentiment Analysis
• AI Chatbot (Multiple
Semantic Levels) Image credits: CS231n, Stanford University; Lee et al, 2009; Socher et al, 2011
2
4
Techniques :: Transforming (cont.)
 Goal: extract semantic features.

• Each specific field and each type of


data uses different semantic
feature extraction techniques (text
data, images, ...)

… and standardize
• Feature discretization: some attributes
are more effective when their values ​are
grouped.

• Feature normalization: standardizes


attribute values ​to the same value
domain, making it easier to calculate.
One-hot encoding
𝑥 −𝑥
𝑠
2
5
Techniques :: Transforming (cont.)
Dimensionality reduction:
Helps reduce the size of data while retaining the core
semantics of the data.
Helps speed up the learning or knowledge discovery process.

Some strategies:
Feature selection: irrelevant, redundant or dimensional
attributes can also be deleted or eliminated
Dimension reduction: using some algorithms (e.g. PCA, ICA,
LDA, ...) to transform the original data into a space with fewer
dimensions.
Abstraction: Raw data values ​are replaced with abstract
concepts.
2
6

Techniques :: Transforming
example & demo

Transforming text data


2
7
DEMO

Input Output
Raw: json text Numerical reprensetation
ML/AI model(s)
2
8
DEMO :: Steps

Data Input
Tokenize Dictionary
(tfidf-Vector)
2
9
DEMO :: Exercise
 Problem: Compute vector representation of text with small
data set.
 Data: 2 articles from Dan Tri page.

 Requirements:
• Use the word segmentation module.
• Build a dictionary from 2 documents
• Use stopwords to filter out stopwords.
• Convert 2 documents into 2 tfidf vectors
Summary 3
0
(Take-home messages)

 Data in a field before entering the machine learning


system must be collected and represented in a structured
form with some characteristics: complete, less noisy,
consistent, and has a defined structure.
 The data collected for the learning process is a small set,
but it needs to fully reflect the aspects of the problem to be
solved.
 The raw data after collection and pre-processing must
retain the fullness of semantic features - features that
affect the ability to solve the problem.

You might also like