0% found this document useful (0 votes)

5 views30 pages

L2 Data Crawling Preprocessing

The document provides an introduction to machine learning and data mining, covering topics such as data crawling, pre-processing, supervised and unsupervised learning. It emphasizes the importance of structured data for effective machine learning and outlines techniques for data collection, cleaning, and transformation. Key takeaways include the necessity for data to be complete, consistent, and reflective of the problem domain before being used in machine learning models.

Uploaded by

hoangvietbui2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views30 pages

L2 Data Crawling Preprocessing

Uploaded by

hoangvietbui2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction to

Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)

School of Information and Communication Technology

Hanoi University of Science and Technology
2025
2
Content
Introduction to Machine Learning & Data Mining

Data crawling and pre-processing

Supervised learning
Unsupervised learning

Practical advice
3
Time resources

 How much time is spent on data

analysis?
 Data collection: 19%
 Data organization and cleaning:
60%
 Training dataset creation: 3%
 Mining: 9%
 Algorithm improvement: 4%
 Other: 5%
4
Why?
What is preprocessing for?
 Convenient for storage and querying
 Machine learning models often work with structured data: matrices, vectors,
strings, etc.
 Machine learning often works effectively if there is a suitable data
representation
Input Output
Original problem Numerical format: Vector, matrix…

[ ][ ]
−0.0920 (1 )
𝑥
𝒟=
3.4931
( 𝑛 ) − 1.8493 (2 )
𝑥 =¿ 𝑥
…
…
¿ − 0.2010 ( )
−1.3079 ¿𝑥𝑛
5
How?
 Data collection
 Sampling
 Techniques: crawling, logging, scraping
 Data processing
 Noise filtering, cleaning, digitizing,…
6
Data collection

Input Output
Original problem Data samples
7
Fundamentals :: Sampling
WHAT – take a small, “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”

popular sample set to

represent the area to be
studied.
WHY – cannot study the
whole thing. Limited by
time and computing
power
HOW – collect samples
from reality, or sources https://www.coursera.org/learn/inferential-statistics-intro

containing web data,

databases,…
8
Fundamentals :: Sampling :: How
Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup
is good or not.”
Remember to stir to avoid tasting biases.
set obtained is
diverse enough to
cover all contexts of
the domain.
Bias – the data
needs to be general,
not biased towards
a small part of the
domain.
https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
Variety – are the samples
diverse enough to reflect
objectivity?

Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
1
0
Techniques
 Crowd-sourcing: Survey – conduct surveys
 Logging: save user interaction history, product access,…
 Scrapping: search for data sources on websites, download,
extract, filter,…
1
1

Techniques :: Scrapping :: DEMO

 Objective: Data for text classification problem –

newspaper domain.
 DEMO: Newspaper data crawling system
1
2
DEMO

Input Output
Problem: Classify Data sample: press and
documents corresponding label
1
3
DEMO :: Steps

Rss Item Content

1
4
DEMO :: Sample
1
5
Data preprocessing

[ 𝒟] =¿
Input Output
Raw data Numerical format

−0.0920
3.4931
( 𝑛 ) − 1.8493
𝑥 =¿ …
¿ − 0.2010
−1.3079
1
6
Fundamentals :: Data “rawness”
Completeness Integrity
The source of collection is authentic,
ensuring that the collected sample
Each collected sample should have contains the correct value in reality.
complete information of the necessary
attribute fields.  Jan. 1 as everyone’s birthday?
– intentional (systematic) noises

Homogeneity Structures
(cấu trúc)
 Rating “1, 2, 3” & “A, B, C”;
or Age = “42” & Birthday =
“03/07/2010”  inconsistency

 Heterogenous data sources /

schemas
1
7

Techniques

Cleaning
Integrating
Transforming
1
8
Techniques :: Cleaning
Completeness + Integrity • Data samples should be
collected from reliable sources.
Reflect the problem to be
solved.
• Eliminate noise (outliers):
remove some data samples
that are significantly different
from other samples.
• A data sample may be empty
(missing, incomplete), need
appropriate strategy:
• Ignore, do not include in
analysis?
• Add missing fields to the
sample?
1
9
Techniques :: Cleaning
Fill missing values  Fill in the value manually
 Assign a special or out-of-range label
value
 Assign it the mean value.
 Assign the mean value of other
samples in the same class.
 Find the value with the highest
probability to fill in the missing space
(regression, Bayesian inference,…)

A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? ? 60 1.573 0 0.639 7 30 5
? 3.096 67 0.249 0 0.089 ? 80 3
2.887 3.870 68 -1.347 ? 1.276 ? 60 5
2.731 3.945 79 1.967 1 2.487 ? 100 4
2
0
Techniques :: Cleaning (cont.)
Homogeneity
Data samples need to be uniform
in representation and notation.

Non-uniform examples:

Rating “1, 2, 3” & “A, B, C”;

Age = 42 & Birthday = 03/08/2020

2
1
Techniques :: Integrating w/ some Transforming

`` Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …

image credits: wikipedia, shutterstock, CNN

2
2
Techniques :: Transforming
Semantics?
Extract semantic features, normalize
2
3
Semantics example: visual data
Mid-/High-level semantics
``
Low-level semantics
(raw pixels) (e.g. human-interpretable features)

cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22

cat not on car

people behind building
car is red

Minimum Semantic Levels for

Understanding:
• Text Classification
• Sentiment Analysis
• AI Chatbot (Multiple
Semantic Levels) Image credits: CS231n, Stanford University; Lee et al, 2009; Socher et al, 2011
2
4
Techniques :: Transforming (cont.)
 Goal: extract semantic features.

• Each specific field and each type of

data uses different semantic
feature extraction techniques (text
data, images, ...)

… and standardize
• Feature discretization: some attributes
are more effective when their values are
grouped.

• Feature normalization: standardizes

attribute values to the same value
domain, making it easier to calculate.
One-hot encoding
𝑥 −𝑥
𝑠
2
5
Techniques :: Transforming (cont.)
Dimensionality reduction:
Helps reduce the size of data while retaining the core
semantics of the data.
Helps speed up the learning or knowledge discovery process.

Some strategies:
Feature selection: irrelevant, redundant or dimensional
attributes can also be deleted or eliminated
Dimension reduction: using some algorithms (e.g. PCA, ICA,
LDA, ...) to transform the original data into a space with fewer
dimensions.
Abstraction: Raw data values are replaced with abstract
concepts.
2
6

Techniques :: Transforming
example & demo

Transforming text data

2
7
DEMO

Input Output
Raw: json text Numerical reprensetation
ML/AI model(s)
2
8
DEMO :: Steps

Data Input
Tokenize Dictionary
(tfidf-Vector)
2
9
DEMO :: Exercise
 Problem: Compute vector representation of text with small
data set.
 Data: 2 articles from Dan Tri page.

 Requirements:
• Use the word segmentation module.
• Build a dictionary from 2 documents
• Use stopwords to filter out stopwords.
• Convert 2 documents into 2 tfidf vectors
Summary 3
0
(Take-home messages)

 Data in a field before entering the machine learning

system must be collected and represented in a structured
form with some characteristics: complete, less noisy,
consistent, and has a defined structure.
 The data collected for the learning process is a small set,
but it needs to fully reflect the aspects of the problem to be
solved.
 The raw data after collection and pre-processing must
retain the fullness of semantic features - features that
affect the ability to solve the problem.

L2 Data Crawling Preprocessinge
No ratings yet
L2 Data Crawling Preprocessinge
30 pages
Data
No ratings yet
Data
36 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Data Preprocessing v6.1
No ratings yet
Data Preprocessing v6.1
64 pages
NN 7
No ratings yet
NN 7
26 pages
Unit 2
No ratings yet
Unit 2
91 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Text Extraction
No ratings yet
Text Extraction
79 pages
Unit 5
No ratings yet
Unit 5
8 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Text Extraction (Irs - Unit 2)
No ratings yet
Text Extraction (Irs - Unit 2)
103 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Chapter I - 231127 - 093902
No ratings yet
Chapter I - 231127 - 093902
22 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Nasscomm Test
No ratings yet
Nasscomm Test
12 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Cs253 01 Introduction Marked
No ratings yet
Cs253 01 Introduction Marked
49 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
DLWSS551 - Introduction
No ratings yet
DLWSS551 - Introduction
54 pages
Machine Learning with Python
100% (1)
Machine Learning with Python
31 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
02 DP
No ratings yet
02 DP
31 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
UGA Data Science & ML Overview
No ratings yet
UGA Data Science & ML Overview
22 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
To Artificial Intelligence: What Is Data Science?
100% (1)
To Artificial Intelligence: What Is Data Science?
131 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data
No ratings yet
Data
70 pages
Machine Learning Study Notes - Quick Review Guide
No ratings yet
Machine Learning Study Notes - Quick Review Guide
12 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
30 pages
Data Mining for Spatial and Web Data
100% (1)
Data Mining for Spatial and Web Data
28 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Field Study 2 Episode 7
No ratings yet
Field Study 2 Episode 7
5 pages
Unit 3 Newsletter - The Move Toward Freedom
No ratings yet
Unit 3 Newsletter - The Move Toward Freedom
5 pages
The Verbal Ability Handbook
No ratings yet
The Verbal Ability Handbook
60 pages
Emotional Intelligence for Leaders
No ratings yet
Emotional Intelligence for Leaders
9 pages
Logosynthesis State of The Art
100% (1)
Logosynthesis State of The Art
21 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
29 pages
Cpte01 PDF
No ratings yet
Cpte01 PDF
10 pages
Introduction To CBT
No ratings yet
Introduction To CBT
12 pages
DLL - All Subjects 2 - Q4 - W2 - D4
No ratings yet
DLL - All Subjects 2 - Q4 - W2 - D4
6 pages
Fail Predicate in Prolog PDF
0% (1)
Fail Predicate in Prolog PDF
2 pages
Conceptual Clarification Questions
No ratings yet
Conceptual Clarification Questions
3 pages
Cohesive Devices Error Analysis
No ratings yet
Cohesive Devices Error Analysis
88 pages
Naeyc Standards
No ratings yet
Naeyc Standards
1 page
Module 5 Behavioral Learning Theories
No ratings yet
Module 5 Behavioral Learning Theories
35 pages
DLP TRENDS Week 3 - Strategic Analysis
83% (6)
DLP TRENDS Week 3 - Strategic Analysis
9 pages
World Building - Discourse in The Mind-Bloomsbury Academic (2016) PDF
100% (4)
World Building - Discourse in The Mind-Bloomsbury Academic (2016) PDF
309 pages
Organizational Power & Politics Guide
No ratings yet
Organizational Power & Politics Guide
20 pages
Machine Learning Life Cycle
No ratings yet
Machine Learning Life Cycle
11 pages
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
No ratings yet
Athifa Aura Kenza Aydin - The Importance of Reading in The Digital Age
4 pages
Action Plan and Training Design Design Thinking
No ratings yet
Action Plan and Training Design Design Thinking
5 pages
Present Simple, Past Simple and Future Simple
100% (1)
Present Simple, Past Simple and Future Simple
9 pages
Expository Writing Checklist
No ratings yet
Expository Writing Checklist
3 pages
Speeches
No ratings yet
Speeches
14 pages
Lexical Bundles 20091
No ratings yet
Lexical Bundles 20091
12 pages
Dribble Survivor Lesson Plan
No ratings yet
Dribble Survivor Lesson Plan
2 pages
MARK SCHEME For The June 2005 Question Paper
No ratings yet
MARK SCHEME For The June 2005 Question Paper
5 pages
Active vs Passive Voice Guide
No ratings yet
Active vs Passive Voice Guide
5 pages
Organizational Behaviour Theory
No ratings yet
Organizational Behaviour Theory
3 pages
Instrumen Saringan LBI Menulis Tahun 4
No ratings yet
Instrumen Saringan LBI Menulis Tahun 4
13 pages
Aiml Vit Phase 2 - Aiml Ap - Bhopal
No ratings yet
Aiml Vit Phase 2 - Aiml Ap - Bhopal
319 pages

L2 Data Crawling Preprocessing

Uploaded by

L2 Data Crawling Preprocessing

Uploaded by

Introduction to

Machine Learning and Data Mining

School of Information and Communication Technology

Data crawling and pre-processing

 How much time is spent on data

popular sample set to

containing web data,

Techniques :: Scrapping :: DEMO

 Objective: Data for text classification problem –

Rss Item Content

 Heterogenous data sources /

Rating “1, 2, 3” & “A, B, C”;

Age = 42 & Birthday = 03/08/2020

image credits: wikipedia, shutterstock, CNN

cat not on car

Minimum Semantic Levels for

• Each specific field and each type of

• Feature normalization: standardizes

Transforming text data

 Data in a field before entering the machine learning

You might also like