Université d’Alger 1
Benyouçef benkhedda
Data Mining
Dr . BOUFENAR Chaouki
Master 1
Ingénierie des Systèmes
Informatiques Intelligents
2018/2019
18/05/2019 Cours de Data Mining 1
What Motivated Data Mining?
Natural evolution of information technology
Wide availability of huge amounts of data
Imminent need for turning data into useful information
18/05/2019 Cours Data mining 2
What is Data Mining ?
“ Data mining refers to extracting or “mining” knowledge from large amounts
of data ” [1]
The term “ Data Mining ” is misnomer ?!
More appropriate term is “ Knowledge Mining ”
Data mining = Knowledge Discovery from Data (KDD)
18/05/2019 Cours Data mining 3
What is Data Mining ?
Data mining is the core of KDD process
Cleaning Preprocessing Transformation Data mining Evaluation
18/05/2019 Cours Data mining 4
Data mining as a confluence of
multiple disciplines
Data Mining
Algorithms Visualisation
18/05/2019 Cours Data mining 5
Goals of Data Mining
Predictions
Earthquakes Sales Volumes
18/05/2019 Cours Data mining 6
Goals of Data Mining
Identification
Security & Crime Detection
Mining Gene Expression on
data for Drug Discovery
18/05/2019 Cours Data mining 7
Goals of Data Mining
Classification
18/05/2019 Cours Data mining 8
Goals of Data Mining
Optimisation
Time Optimisation
Space Optimisation
Sales maximisation
18/05/2019 Cours Data mining 9
Data Mining Techniques
18/05/2019 Cours Data mining 10
Classification Vs Prediction
Target attributes
Categorical/Discret Numerical/Continuous
Classification Prediction
learn which loan applicants are “safe” and
which are “risky” for the bank marketing manager would like to predict
how much a given customer will spend
analyze breast cancer data in order to
during a sale at AllElectronics
predict which one of three specific
treatments a patient should receive
18/05/2019 Cours Data mining 11
Classification process
Learning
Training data
Classification
Algorithm
Classification rules
IF age = youth THEN
loan_decision = risky
IF income = high THEN
loan_decision = safe
IF age = middle_aged AND income = low
THEN loan_decision = risky
18/05/2019 Cours Data mining 12
Classification process
Classification
Test data
Classification
rules
New data
(Dnnnn , Middle-age , Low )
loan_decision = ?
risky
18/05/2019 Cours Data mining 13
Data Cleaning and Preprocessing
Real-world data
Missing values Noisy inconsistent
Ignore the tuple
Fill in the missing value manually
Use a global constant (“Unknown” or −∞)
Use the attribute mean
Use the most probable value
18/05/2019 Cours Data mining 14
Data Cleaning and Preprocessing
Real-world data
Missing values Noisy inconsistent
Noise is a random error or variance in a measured variable
Example
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Irregular data Smoothing regular data
Remove irregularities
Binning Regression Clustering
18/05/2019 Cours Data mining 15
Data Cleaning and Preprocessing
4 8 15 21 21 24 25 28 34
Binning
Partition into (equal-frequency) bins
Bin 1 Bin 2 Bin 3
4 8 15 21 21 24 25 28 34
Smoothing by bin means Smoothing by bin boundaries
9 9 9 22 22 22 29 29 29 4 4 15 21 21 24 25 25 34
Bin 1 Bin 2 Bin 3 Bin 1 Bin 2 Bin 3
18/05/2019 Cours Data mining 16
Data Cleaning and Preprocessing
Regression
Regression is a set of statistical methods for estimating the relationships among variables
Linear regression quantifies the relationship between one or more predictor variables
(independent or explanatory variables) and one outcome variable (dependent variable)
18/05/2019 Cours Data mining 17
Data Cleaning and Preprocessing
Clustering
similar values are organized into groups, or clusters. Values that fall outside of the set of
clusters may be considered outliers
18/05/2019 Cours Data mining 18
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Smoothing Aggregation Generalisation Normalisation Attribute construction
18/05/2019 Cours Data mining 19
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Aggregation Generalisation Normalisation Attribute construction
The daily sales data may be aggregated so as to compute monthly and annual total amounts.
Data cubes store multidimensional aggregated information.
18/05/2019 Cours Data mining 20
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Generalisation Normalisation Attribute construction
Low-level Low-level
Generalisation
data concepts
Categorical attributes, like street, can be generalized to higher-level concepts, like
city or country
values for numerical attributes, like age, may be mapped to higher-level concepts, like
youth, middle-aged, and senior
18/05/2019 Cours Data mining 21
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Normalisation Attribute construction
the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0
or 0.0 to 1.0
Min-max Normalisation
Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000 − 12,000
• Then $73,000 is mapped to : 1.0 − 0 + 0 = 0,716
98,000 − 12,000
z-score normalization
Let Ā= 54,000, σA= 16,000, for the attribute income
73,600 − 54,000
• a value of $73,600 for income is transformed to : = 1,225
16,000
18/05/2019 Cours Data mining 22
Data Transformation
Data are transformed or consolidated in to forms in appropriate mining.
Attribute construction
new attributes are constructed and added from the given set of attributes to help the mining
process
we may wish to add the attribute area based on the attributes height and width.
By attribute construction can discover missing values.
18/05/2019 Cours Data mining 23
Attributes Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.
if the task is to classify customers as to whether or not they are likely to purchase a popular
new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music_taste.
Speed-up the mining processes by removing the irrelevent or redundant attributes
Reduces the number of attributes appearing in the discovered patterns
Make the patterns easier to understand.
18/05/2019 Cours Data mining 24
Data Mining
Techniques
Supervised Unsupervised
Classification Clustering
o Data are labeled with pre-defined classes o Class labels are Unknown
o Test data are classified into these classes o Establish the existence of classes (Clusters)
in the data
18/05/2019 Cours Data mining 25
Performance Measure
Confusion Matrix
Error Rate =
Precision =
FP Rate =
Specificity = 1-FP Rate
Sensibility = Rappel =
18/05/2019 Cours Data mining 26
Spliting Dataset
Data set
Train Validation Test
Training Dataset : The sample of data used to fit the model.
Validation Dataset: The sample of data used to provide an unbiased evaluation of
a model fit on the training dataset while tuning model hyper-parameters.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.
18/05/2019 Cours Data mining 27
Overfitting
Overfitting : a model that is too specialized on Training Set data and that will not
generalize well
• Properties
• Generalisable correlations
• Fluctuations
• Random variations
lead to Overfitting Badly predictions on Test data
• Noise
• Outlier
18/05/2019 Cours Data mining 28
Overfitting
• Blue line : a prediction function
• Green points : Training Set data
• Red points : Testing Set data
18/05/2019 Cours Data mining 29
Overfitting
Underfitting Appropriate fitting Overfitting
High bias High Variance
18/05/2019 Cours Data mining 30
How to avoid Overfitting?
Gather more data
Data augmentation
Simplify the model
Early termination
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect
18/05/2019 Cours Data mining 31
How to avoid Overfitting?
Gather more data The more data you get, the less likely the model is to overfit.
Adding more data
Data augmentation
Simplify the model the model becomes unable to overfit all the samples
Early termination forced the model to generalize
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect
18/05/2019 Cours Data mining 32
How to avoid Overfitting?
Gather more data
Data augmentation The more data you get, the less likely the model is to overfit.
18/05/2019 Cours Data mining 33
How to avoid Overfitting?
Gather more data
Data augmentation Collecting more data is a tedious and expensive process
Dilatation, Rotation, Adding noise, …
Simplify the model
Early termination
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect
18/05/2019 Cours Data mining 34
How to avoid Overfitting?
Gather more data
Reducing its complexity
Data augmentation # of estimators in a random forest,
# of parameters in a neural network
Simplify the model
Early termination
Model lighter, train faster and run faster.
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect
18/05/2019 Cours Data mining 35
How to avoid Overfitting?
Gather more data
Data augmentation
Simplify the model
Early termination
L1 and L2 regularization When the testing error starts to increase, it’s time to stop!
For Deep Learning : Dropout and Dropconnect
18/05/2019 Cours Data mining 36
How to avoid Overfitting?
add a penalty to the loss function
Gather more data
The L1 penalty The L2 penalty
Data augmentation
Simplify the model
minimize the squared magnitude
Early termination minimize the squared magnitude
L1 and L2 regularization • The model is forced to make compromises on its weights,
as it can no longer make them arbitrarily large.
For Deep Learning : Dropout and Dropconnect
• This makes the model more general, which helps combat
overfitting.
18/05/2019 Cours Data mining 37
How to avoid Overfitting?
Gather more data
Data augmentation
Simplify the model
Early termination
L1 and L2 regularization
For Deep Learning : Dropout and Dropconnect
Randomly deactivate either neurons (dropout) or connections (dropconnect) during the
training.
18/05/2019 Cours Data mining 38
Data Mining
Supervised
Step 01 : Learning
Better generalisable
Learning
Training data Model
Algorithm
Step 02 : Testing
Test Model Evaluation
data
18/05/2019 Cours Data mining 39
Data Mining
Supervised Learning
18/05/2019 Cours Data mining 40
Data Mining
Unsupervised
The data have no target attribute
We want to explore the data to find some intrinsic structures in them
Methods
Hierarchical Not Hierarchical
Hierarchical Cluster Analysis K-Means self organizing maps Clustering
Classification Ascendante centres mobiles Cartes topologiques de
Hiérarchique Kohonen
18/05/2019 Cours Data mining 41
Data Mining
Unsupervised
Hierarchical method
The optimal number of classes is determined by reading the tree
Very expensive in computation time
Not Hierarchical method
Allow a classification of a huge sets of data
We initially impose the number of classes
18/05/2019 Cours Data mining 42
Data Mining
Hierarchical method
18/05/2019 Cours Data mining 43
Data Mining
Hierarchical method
Principe
Repeatdely combine two nearest objects
Data structures
Data matrix (Object-by-Variable) Dissimilarity matrix (Object-by-Object)
n objets (persons)
p variables (age, height, weight, gender, …)
18/05/2019 Cours Data mining 44
Data Mining
Hierarchical method
Classification criteria
Similarity measurement between objects : Euclidean distance
, = −
Similarity measurement between groups of objects
• Single linkage (smaller distance)
• Complete linkage (Large distance)
18/05/2019 Cours Data mining 45
Data Mining
Hierarchical method
How partition is good?
Intra cluster distance for each cluster is Min
Two criterias ?????
Inter cluster distance for each cluster is Max
Intra cluster distance = SSE (Sum Square Error ) intra class or cluster = Intra inertia
− : The mean in cluster q
Inter cluster distance = SSE (Sum Square Error ) inter class or cluster = Inter inertia
− : The mean in all
18/05/2019 Cours Data mining 46
Data Mining
Hierarchical method
How partition is good?
Huygens theory
− = − + −
One criteria
18/05/2019 Cours Data mining 47
References
J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier
Inc. (2006).
18/05/2019 Cours Data mining 48
Cours de Compilation L3 SI - Analyse
18/05/2019 49
Syntaxique