0% found this document useful (0 votes)

140 views49 pages

Data Mining

The document discusses data mining and knowledge discovery from data. It defines data mining as extracting knowledge from large amounts of data and notes it is the core of the knowledge discovery process. The goals of data mining include prediction, identification, classification, and optimization. Some key techniques discussed are classification, prediction, data cleaning and preprocessing methods like handling missing values, binning, smoothing, regression, and clustering. Data transformation methods covered include aggregation, generalization, normalization, and attribute construction.

Uploaded by

Lôny Nêz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views49 pages

Data Mining

Uploaded by

Lôny Nêz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Université d’Alger 1

Benyouçef benkhedda

Data Mining
Dr . BOUFENAR Chaouki

Master 1
Ingénierie des Systèmes
Informatiques Intelligents
2018/2019

18/05/2019 Cours de Data Mining 1

What Motivated Data Mining?
 Natural evolution of information technology

 Wide availability of huge amounts of data

 Imminent need for turning data into useful information

18/05/2019 Cours Data mining 2

What is Data Mining ?

“ Data mining refers to extracting or “mining” knowledge from large amounts

of data ” [1]

The term “ Data Mining ” is misnomer ?!

More appropriate term is “ Knowledge Mining ”

Data mining = Knowledge Discovery from Data (KDD)

18/05/2019 Cours Data mining 3

What is Data Mining ?

Data mining is the core of KDD process

Cleaning Preprocessing Transformation Data mining Evaluation

18/05/2019 Cours Data mining 4

Data mining as a confluence of
multiple disciplines

Data Mining
Algorithms Visualisation

18/05/2019 Cours Data mining 5

Goals of Data Mining

Predictions

Earthquakes Sales Volumes

18/05/2019 Cours Data mining 6

Goals of Data Mining

Identification
Security & Crime Detection

Mining Gene Expression on

data for Drug Discovery

18/05/2019 Cours Data mining 7

Goals of Data Mining

Classification

18/05/2019 Cours Data mining 8

Goals of Data Mining

Optimisation

Time Optimisation

Space Optimisation

Sales maximisation

18/05/2019 Cours Data mining 9

Data Mining Techniques

18/05/2019 Cours Data mining 10

Classification Vs Prediction

Target attributes

Categorical/Discret Numerical/Continuous

Classification Prediction

 learn which loan applicants are “safe” and

which are “risky” for the bank  marketing manager would like to predict
how much a given customer will spend
 analyze breast cancer data in order to
during a sale at AllElectronics
predict which one of three specific
treatments a patient should receive

18/05/2019 Cours Data mining 11

Classification process
Learning

Training data

Classification
Algorithm

Classification rules
 IF age = youth THEN
loan_decision = risky
 IF income = high THEN
loan_decision = safe
 IF age = middle_aged AND income = low
THEN loan_decision = risky

18/05/2019 Cours Data mining 12

Classification process
Classification

Test data

Classification
rules

New data
(Dnnnn , Middle-age , Low )
loan_decision = ?

risky

18/05/2019 Cours Data mining 13

Data Cleaning and Preprocessing

Real-world data

Missing values Noisy inconsistent

 Ignore the tuple

 Fill in the missing value manually
 Use a global constant (“Unknown” or −∞)
 Use the attribute mean
 Use the most probable value

18/05/2019 Cours Data mining 14

Data Cleaning and Preprocessing

Real-world data

Missing values Noisy inconsistent

Noise is a random error or variance in a measured variable

Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Irregular data Smoothing regular data

Remove irregularities

Binning Regression Clustering

18/05/2019 Cours Data mining 15

Data Cleaning and Preprocessing
4 8 15 21 21 24 25 28 34

Binning
Partition into (equal-frequency) bins

Bin 1 Bin 2 Bin 3

4 8 15 21 21 24 25 28 34

Smoothing by bin means Smoothing by bin boundaries

9 9 9 22 22 22 29 29 29 4 4 15 21 21 24 25 25 34

Bin 1 Bin 2 Bin 3 Bin 1 Bin 2 Bin 3

18/05/2019 Cours Data mining 16

Data Cleaning and Preprocessing
Regression

Regression is a set of statistical methods for estimating the relationships among variables

Linear regression quantifies the relationship between one or more predictor variables
(independent or explanatory variables) and one outcome variable (dependent variable)

18/05/2019 Cours Data mining 17

Data Cleaning and Preprocessing
Clustering

similar values are organized into groups, or clusters. Values that fall outside of the set of
clusters may be considered outliers

18/05/2019 Cours Data mining 18

Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Smoothing Aggregation Generalisation Normalisation Attribute construction

18/05/2019 Cours Data mining 19

Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Aggregation Generalisation Normalisation Attribute construction

The daily sales data may be aggregated so as to compute monthly and annual total amounts.

 Data cubes store multidimensional aggregated information.

18/05/2019 Cours Data mining 20

Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Generalisation Normalisation Attribute construction

Low-level Low-level
Generalisation
data concepts

 Categorical attributes, like street, can be generalized to higher-level concepts, like

city or country
 values for numerical attributes, like age, may be mapped to higher-level concepts, like
youth, middle-aged, and senior
18/05/2019 Cours Data mining 21
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Normalisation Attribute construction

the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0
or 0.0 to 1.0

 Min-max Normalisation
Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000 − 12,000
• Then $73,000 is mapped to : 1.0 − 0 + 0 = 0,716
98,000 − 12,000
 z-score normalization
Let Ā= 54,000, σA= 16,000, for the attribute income
73,600 − 54,000
• a value of $73,600 for income is transformed to : = 1,225
16,000
18/05/2019 Cours Data mining 22
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.

Attribute construction

new attributes are constructed and added from the given set of attributes to help the mining
process

we may wish to add the attribute area based on the attributes height and width.

By attribute construction can discover missing values.

18/05/2019 Cours Data mining 23

Attributes Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.

if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music_taste.

 Speed-up the mining processes by removing the irrelevent or redundant attributes

 Reduces the number of attributes appearing in the discovered patterns
 Make the patterns easier to understand.

18/05/2019 Cours Data mining 24

Data Mining

Techniques

Supervised Unsupervised

Classification Clustering

o Data are labeled with pre-defined classes o Class labels are Unknown

o Test data are classified into these classes o Establish the existence of classes (Clusters)
in the data

18/05/2019 Cours Data mining 25

Performance Measure
Confusion Matrix

Error Rate =

Precision =

FP Rate =

Specificity = 1-FP Rate

Sensibility = Rappel =
18/05/2019 Cours Data mining 26
Spliting Dataset
Data set

Train Validation Test

Training Dataset : The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of

a model fit on the training dataset while tuning model hyper-parameters.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.

18/05/2019 Cours Data mining 27

Overfitting

Overfitting : a model that is too specialized on Training Set data and that will not
generalize well

• Properties
• Generalisable correlations
• Fluctuations
• Random variations
lead to Overfitting Badly predictions on Test data
• Noise
• Outlier

18/05/2019 Cours Data mining 28

Overfitting

• Blue line : a prediction function

• Green points : Training Set data
• Red points : Testing Set data

18/05/2019 Cours Data mining 29

Overfitting

Underfitting Appropriate fitting Overfitting

High bias High Variance

18/05/2019 Cours Data mining 30

How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 31

How to avoid Overfitting?

Gather more data The more data you get, the less likely the model is to overfit.

Adding more data

Data augmentation

Simplify the model the model becomes unable to overfit all the samples

Early termination forced the model to generalize

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 32

How to avoid Overfitting?

Gather more data

Data augmentation The more data you get, the less likely the model is to overfit.

18/05/2019 Cours Data mining 33

How to avoid Overfitting?

Gather more data

Data augmentation Collecting more data is a tedious and expensive process

 Dilatation, Rotation, Adding noise, …
Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 34

How to avoid Overfitting?

Gather more data

Reducing its complexity

Data augmentation # of estimators in a random forest,

# of parameters in a neural network

Simplify the model

Early termination
Model lighter, train faster and run faster.
L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 35

How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization When the testing error starts to increase, it’s time to stop!

For Deep Learning : Dropout and Dropconnect

18/05/2019 Cours Data mining 36

How to avoid Overfitting?
add a penalty to the loss function
Gather more data

The L1 penalty The L2 penalty

Data augmentation

Simplify the model

minimize the squared magnitude

Early termination minimize the squared magnitude

L1 and L2 regularization • The model is forced to make compromises on its weights,

as it can no longer make them arbitrarily large.
For Deep Learning : Dropout and Dropconnect
• This makes the model more general, which helps combat
overfitting.

18/05/2019 Cours Data mining 37

How to avoid Overfitting?

Gather more data

Data augmentation

Simplify the model

Early termination

L1 and L2 regularization

For Deep Learning : Dropout and Dropconnect

Randomly deactivate either neurons (dropout) or connections (dropconnect) during the
training.

18/05/2019 Cours Data mining 38

Data Mining
Supervised

Step 01 : Learning

Better generalisable

Learning
Training data Model
Algorithm

Step 02 : Testing

Test Model Evaluation

data

18/05/2019 Cours Data mining 39

Data Mining
Supervised Learning

18/05/2019 Cours Data mining 40

Data Mining
Unsupervised
 The data have no target attribute

 We want to explore the data to find some intrinsic structures in them

Methods

Hierarchical Not Hierarchical

Hierarchical Cluster Analysis K-Means self organizing maps Clustering

Classification Ascendante centres mobiles Cartes topologiques de
Hiérarchique Kohonen

18/05/2019 Cours Data mining 41

Data Mining
Unsupervised

Hierarchical method

The optimal number of classes is determined by reading the tree

Very expensive in computation time

Not Hierarchical method

Allow a classification of a huge sets of data

We initially impose the number of classes

18/05/2019 Cours Data mining 42

Data Mining
Hierarchical method

18/05/2019 Cours Data mining 43

Data Mining
Hierarchical method

Principe

 Repeatdely combine two nearest objects

Data structures

 Data matrix (Object-by-Variable)  Dissimilarity matrix (Object-by-Object)

 n objets (persons)
 p variables (age, height, weight, gender, …)

18/05/2019 Cours Data mining 44

Data Mining
Hierarchical method

Classification criteria

 Similarity measurement between objects : Euclidean distance

, = −

 Similarity measurement between groups of objects

• Single linkage (smaller distance)

• Complete linkage (Large distance)

18/05/2019 Cours Data mining 45

Data Mining
Hierarchical method

How partition is good?

 Intra cluster distance for each cluster is Min

Two criterias ?????
 Inter cluster distance for each cluster is Max

Intra cluster distance = SSE (Sum Square Error ) intra class or cluster = Intra inertia

− : The mean in cluster q

Inter cluster distance = SSE (Sum Square Error ) inter class or cluster = Inter inertia

− : The mean in all

18/05/2019 Cours Data mining 46

Data Mining
Hierarchical method

How partition is good?

Huygens theory

− = − + −

One criteria

18/05/2019 Cours Data mining 47

References

J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier

Inc. (2006).

18/05/2019 Cours Data mining 48

Cours de Compilation L3 SI - Analyse
18/05/2019 49
Syntaxique

UNIT 2-3 - Notes - Unit-2-3-Notes
No ratings yet
UNIT 2-3 - Notes - Unit-2-3-Notes
16 pages
Neuronetworksbook
No ratings yet
Neuronetworksbook
752 pages
Sign Language Detection
No ratings yet
Sign Language Detection
32 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Statistical Machine Learning Assignment
No ratings yet
Statistical Machine Learning Assignment
5 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Mining Model Performance of Sales Predictive Algorithms Based On Rapidminer Workflows
No ratings yet
Data Mining Model Performance of Sales Predictive Algorithms Based On Rapidminer Workflows
18 pages
CH 6
No ratings yet
CH 6
72 pages
Diabetes Prediction Using Data Mining
No ratings yet
Diabetes Prediction Using Data Mining
17 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
16 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Lazy Lerners (Learning From Your Neighbours)
100% (1)
Lazy Lerners (Learning From Your Neighbours)
11 pages
Artificial Intelligence and Deep Learning
0% (1)
Artificial Intelligence and Deep Learning
9 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Data Mining: Business Intelligence
No ratings yet
Data Mining: Business Intelligence
68 pages
Model Building Through
No ratings yet
Model Building Through
21 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Deep Learning Based Recommendation Systems
No ratings yet
Deep Learning Based Recommendation Systems
47 pages
Efficient Sequential Pattern Mining
No ratings yet
Efficient Sequential Pattern Mining
7 pages
Data Science
100% (1)
Data Science
31 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
ML Project Guide for Practitioners
No ratings yet
ML Project Guide for Practitioners
7 pages
MLOPs Original
No ratings yet
MLOPs Original
27 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
44 pages
Supervised & Deep Learning Guide
No ratings yet
Supervised & Deep Learning Guide
83 pages
ANN Unit-2 Chapter-2
No ratings yet
ANN Unit-2 Chapter-2
56 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
Education Loan Prediction Analysis
No ratings yet
Education Loan Prediction Analysis
5 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
REPORT On DECISION TREE
No ratings yet
REPORT On DECISION TREE
40 pages
Excel Data Analysis Resources
No ratings yet
Excel Data Analysis Resources
1 page
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Data Mining
No ratings yet
Data Mining
27 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
57 pages
Pattern Recognition in AI
No ratings yet
Pattern Recognition in AI
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
L2 - Machine Learning Process
No ratings yet
L2 - Machine Learning Process
17 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
100% (1)
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
57 pages
Street View Number Recognition Project
No ratings yet
Street View Number Recognition Project
2 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
No ratings yet
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
9 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Multivariate Linear Regression Guide
No ratings yet
Multivariate Linear Regression Guide
24 pages
DM Case Studies
No ratings yet
DM Case Studies
24 pages
DBSCAN Algorithm for Data Scientists
No ratings yet
DBSCAN Algorithm for Data Scientists
10 pages
Lung Cancer Prognosis: ML Algorithm Evaluation
No ratings yet
Lung Cancer Prognosis: ML Algorithm Evaluation
11 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Mining Basics for Beginners
100% (1)
Data Mining Basics for Beginners
7 pages
Slide 4 - Linear Regression With Multiple Variables
100% (1)
Slide 4 - Linear Regression With Multiple Variables
30 pages
Slide 11 - Anomaly Detection PDF
No ratings yet
Slide 11 - Anomaly Detection PDF
31 pages
Slide 11 - Anomaly Detection PDF
No ratings yet
Slide 11 - Anomaly Detection PDF
31 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Slide 12 - Dimentionality Reduction - PCA
No ratings yet
Slide 12 - Dimentionality Reduction - PCA
26 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
Slide 7 - Neural Networks
No ratings yet
Slide 7 - Neural Networks
64 pages
SVM Optimization for ML Experts
No ratings yet
SVM Optimization for ML Experts
27 pages
Tablet Gaze Estimation for Researchers
No ratings yet
Tablet Gaze Estimation for Researchers
18 pages
IISc DL Detailed Curriculum
No ratings yet
IISc DL Detailed Curriculum
7 pages
Handwritten Telugu Character Recognition Using Machine Learning
No ratings yet
Handwritten Telugu Character Recognition Using Machine Learning
6 pages
Offline Handwritten English Script Recognition: A Survey
No ratings yet
Offline Handwritten English Script Recognition: A Survey
11 pages
Face Mask Detection System Using Ai
No ratings yet
Face Mask Detection System Using Ai
5 pages
Vishnu (435) Artificial Intelligence in 5g Technology
No ratings yet
Vishnu (435) Artificial Intelligence in 5g Technology
7 pages
IEEE Conference Template PDF
No ratings yet
IEEE Conference Template PDF
7 pages
Machine Learning Engineer Nanodegree: Capstone Proposal
No ratings yet
Machine Learning Engineer Nanodegree: Capstone Proposal
3 pages
Benchmarking Algorithms For Automatic License Plate Recognition
No ratings yet
Benchmarking Algorithms For Automatic License Plate Recognition
6 pages
Ransomware Detection & Identification Using AI: by Leon Wiskie
No ratings yet
Ransomware Detection & Identification Using AI: by Leon Wiskie
45 pages
The Role of Causality in Explainable Artificial in
No ratings yet
The Role of Causality in Explainable Artificial in
22 pages
Neural Networks
No ratings yet
Neural Networks
13 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
CMC 47869 Annotation2
No ratings yet
CMC 47869 Annotation2
23 pages
Spikingjet: Enhancing Fault Injection For Fully and Convolutional Spiking Neural Networks
No ratings yet
Spikingjet: Enhancing Fault Injection For Fully and Convolutional Spiking Neural Networks
7 pages
Robotics and Automation With Artificial Intelligen
No ratings yet
Robotics and Automation With Artificial Intelligen
14 pages
A Smart System For Personal Protective Equipment Detection in Industrial Environments Based On Deep Learning at The Edge
No ratings yet
A Smart System For Personal Protective Equipment Detection in Industrial Environments Based On Deep Learning at The Edge
17 pages
Data Science Seminar for M.Tech Students
No ratings yet
Data Science Seminar for M.Tech Students
54 pages
Haar Wavelet Downsampling
No ratings yet
Haar Wavelet Downsampling
14 pages
Implementing Pointnet For Point Cloud Segmentation in The Heritage Context
No ratings yet
Implementing Pointnet For Point Cloud Segmentation in The Heritage Context
18 pages
Animal Research and One Health - 2023 - Zhang - Advancements in Artificial Intelligence Technology For Improving Animal
No ratings yet
Animal Research and One Health - 2023 - Zhang - Advancements in Artificial Intelligence Technology For Improving Animal
17 pages
Project Report - Sign Language To Text Conversion
No ratings yet
Project Report - Sign Language To Text Conversion
58 pages
AI & Deep Learning Certification Course
No ratings yet
AI & Deep Learning Certification Course
12 pages
Weed Detection with CNN
No ratings yet
Weed Detection with CNN
4 pages
Anomaly Detection and Failure Root Cause Analysis
No ratings yet
Anomaly Detection and Failure Root Cause Analysis
36 pages
Final Report
No ratings yet
Final Report
28 pages
A Comparative Analysis of Face Recognition Models On Masked Faces
No ratings yet
A Comparative Analysis of Face Recognition Models On Masked Faces
4 pages
CVlecture 5
No ratings yet
CVlecture 5
56 pages