Analysis and Management
of Production System
Lesson 17: Data analysis
Prof. Giulia Bruno
Department of Management and
Production Engineering
[email protected]
Dataset
Attributes
ID COLOR AGE WEIGHT ON
Collection of data
1 Orange 24 70.24 Yes
objects and their
attributes. 2 Blue 12 56.45 Yes
3 Red 58 67.23 Yes
An attribute is a
4 Orange 43 62.50 Yes
property or
Objects 5 Orange 18 37.47 No
characteristic of an
6 Blue 19 81.35 No
object.
7 Green 62 44.45 Yes
A collection of
8 Orange 33 23.34 No
attributes describe an
9 Green 20 26.35 No
object.
10 Red 47 57.89 Yes
11 Orange 39 52.98 No
12 Green 30 87.43 Yes
13 Blue 29 77.79 No
Types of data
1 NOMINAL 1. DISTINCTNESS
ID numbers, eye color, gender =, ≠
1, 2 ORDINAL 2. ORDER
rankings, hardness of minerals, grades <, >
1, 2, 3 INTERVAL 3. DIFFERENCES
calendar dates, temperatures in Celsius, … +, −
1, 2, 3, 4 RATIO 4. RATIO
temperature in Kelvin, length, counts… ∗,/
Record Data
● Data matrix Projection Projection Distance Load Thickness
of x Load of y load
same fixed set of numeric
attributes 10.23 5.27 15.22 2.7 1.2
data objects can be thought of as 12.65 6.25 16.22 2.2 1.1
points in a multi-dimensional
space, where each dimension
timeout
season
coach
game
score
play
represents a distinct attribute
team
win
ball
lost
● Document data
each term is a component Document 1 3 0 5 0 2 6 0 2 0 2
(attribute) of the vector Document 2 0 7 0 2 1 0 0 3 0 0
the value of each component is the Document 3 0 1 0 0 1 2 2 0 3 0
number of times the corresponding
term occurs in the document TID Items
● Transaction data 1 Bread, Coke, Milk
each transaction involves a set of 2 Beer, Bread
items (e.g., set of products 3 Beer, Coke, Diaper, Milk
purchased by a customer during a
shopping session) 4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: generic graphs, molecules, webpages
2
5 1
2
5
Benzene Molecule: C6H6
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
● Sequence of data CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
● Sequences of transactions
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Important characteristics of data
● Dimensionality (number of attributes)
High dimensional data brings a number of challenges
● Sparsity
Only presence counts
● Resolution
Patterns depend on the scale
● Size
Type of analysis may depend on size of data
From data to knowledge (KDD)
Data preprocessing/cleaning
The representation and quality of data is first and foremost before
running any analysis
If there is much irrelevant and redundant information or noisy and
unreliable data, then knowledge discovery is more difficult
EXAMPLES OF DATA QUALITY PROBLEMS
● Noise and outliers
● Missing values
● Duplicate data
● Wrong data
Noise and outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
● Case 1: outliers are noise that interferes with data
analysis
● Case 2: outliers are the goal of the analysis
Box plot is a useful graphical method for
describing the behavior of the data and
identify outliers
Examples of applications: detection of
credit card frauds (anomalous set of
purchases), detection of unusual pattern in
medical diagnosis
Missing values
Reasons for missing values
● Information is not collected
● Attributes may not be applicable to all cases
Handling missing values
● Eliminate data objects or attribute
● Estimate missing values
● Ignore the missing value during analysis
Duplicate data
Data set may contain data objects that are duplicates, or
almost duplicates of one another (for example same person,
object, with multiple email addresses)
It can happen when data from heterogeneous sources are
merged
When is a data considered duplicate?
When the similarity between data is equal to 1
Similarity
● Numerical measure of how alike two data objects are
● It is higher when objects are more alike
● Often falls in the range [0,1]
It is possible to use different distance definitions to express the
similarity
ATTRIBUTE TYPE DISTANCE (DISSIMILARITY)
0 𝑖𝑓 𝑥 = 𝑦
Nominal d(𝑥, 𝑦) = ቊ
1 𝑖𝑓 𝑥 ≠ 𝑦
𝑥−𝑦
𝑑 𝑥, 𝑦 =
Ordinal 𝑛−1
(values mapped to integers 0 to n-1
where n is the number of values)
𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 ,
Interval or Ratio
𝑑 𝑥, 𝑦 = σ𝑛𝑘=1 𝑥𝑘 − 𝑦𝑘 2
Similarity: s = 1 − d
Wrong data
The value of a certain attribute can depend on the values of some
others, so it’s possible to check the satisfaction of the dependence
Example: the italian fiscal code which depends on the personal data
Using as example the date of birth we can easily find out which of
these three record are correct (last two numbers of the birth year,
letter representing the month, two numbers for the day)
CSTGPP73A25I452U 25/01/1973
NDDPRI82E30A859Z 31/05/1982
CSTGPP00A01G732I 01/01/2000
Data Transformation
Techniques for data transformation
● Aggregation
● Sampling
● Dimensionality Reduction
Feature selection
Feature creation
Mapping data to new space
● Discretization and Binarization
● Attribute Transformation
Aggregation
● Combining two or more attributes (or objects) into a
single attribute (or object)
● Purpose
Data reduction
• Reduce the number of attributes or objects
Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
More “stable” data
• Aggregated data tends to have less variability
Sampling
● Main technique used for data reduction
considering the entire set of data of interest can be too
expensive or time consuming
● Key principle: using a sample will work almost as well
as using the entire data set, if the sample is
representative
a sample is representative if it has approximately the same
properties as the original set of data
8000 points 2000 Points 500 Points
Dimensionality Reduction
Reduction of the dimensionality of the dataset, i.e., the number of
attributes
reduce amount of time and memory required by algorithms
allow data to be more easily visualized
may help to eliminate irrelevant features or reduce noise
● Techniques
Feature selection: remove redundant features (e.g., the purchase
price of a product has the same information of the amount of sales
tax paid) or irrelevant features (e.g., students' ID is irrelevant to the
task of predicting students' marks)
Feature creation: create new attributes that can capture the
important information in a data set much more efficiently than the
original attributes
• Feature extraction (e.g., extracting edges from images)
• Feature construction (e.g., dividing mass by volume to get density)
• Mapping data to new space (e.g., Fourier and wavelet analysis)
Feature selection vs. Feature extraction
FEATURE SELECTION
𝑋1 , … , 𝑋𝑝 → 𝑋𝑘1 , … , 𝑋𝑘𝑚
FEATURE EXTRACTION
𝑋1 , … , 𝑋𝑝 → 𝑍1 , … , 𝑍𝑚
𝑍1 , … , 𝑍𝑚 = 𝑓1 (𝑋1 , … , 𝑋𝑝 ), … , 𝑓𝑚 (𝑋1 , … , 𝑋𝑝 )
Mapping data to a new space
Time domain Frequency domain
Discretization
● Process of converting a continuous attribute into an ordinal
attribute
A potentially infinite number of values are mapped into a
small number of categories
Discretization is commonly used in classification
Many classification algorithms work best if both the
independent and dependent variables have only a few values
Binarization
● Binarization maps a continuous or categorical attribute into
one or more binary variables
● Typically used for association analysis
● Often convert a continuous attribute to a categorical attribute
and then convert a categorical attribute to a set of binary
attributes
Attribute Transformation
● An attribute transformation is a function that maps
the entire set of values of a given attribute to a new set
of replacement values such that each old value can be
identified with one of the new values
Simple functions: xk, log(x), ex, |x|
Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean,
variance, range
• Take out unwanted, common signal, e.g., seasonality
In statistics, standardization refers to subtracting off the
means and dividing by the standard deviation
Data mining
Predictive modeling – supervised learning
● Regression: Predict a value of a given continuous
valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency
● Classification: Find a model for class attribute as a
function of the values of other attributes
Clustering – unsupervised learning
● Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Association Rule
● Given a set of records each of which contain
some number of items from a given collection
Produce dependency rules which will predict occurrence
of an item based on occurrences of other items
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Deviation/ Anomaly/Change Detection
• Detect significant deviations from normal
behavior
• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
• Identify anomalous behavior from sensor
networks for monitoring and surveillance
• Detecting changes in the global forest cover
Data mining vs. Machine learning
In the 1960s, statisticians and economists used terms like data fishing or data
dredging to refer to what they considered the bad practice of analyzing data
without an a-priori hypothesis. The term "data mining" was used in a similarly
critical way by economist Michael Lovell in an article published in the Review of
Economic Studies in 1983.
Lovell, Michael C., Data Mining (1983). The Review of Economics and Statistics.
Data mining, also called knowledge discovery in databases, in computer science,
is the process of discovering interesting and useful patterns and relationships in
large volumes of data. The field combines tools from statistics and artificial
intelligence (such as neural networks and machine learning) with database
management to analyze large digital collections, known as data sets.
Christopher Clifton, Data mining (2019). Encyclopædia Britannica, inc.
Data mining vs. Machine learning
Machine learning algorithms build a mathematical model based on
sample data, known as "training data", in order to make predictions or
decisions without being explicitly programmed to perform the task.
Bishop, C. M., Pattern Recognition and Machine Learning (20006). Springer.
The definition "without being explicitly programmed" is often attributed
to Arthur Samuel, who coined the term "machine learning" in 1959.
Data mining vs. Machine learning
1. Meaning
● Extracting knowledge from a large ● Extracting new algorithms from data as
amount of data well as experience
2. History
● Introduced in 1930, initially referred as ● Introduce in near 1950, the first program
knowledge discovery in databases was Samuel’s checker-playing program
3. Responsibility
● Data mining is used to get the rules ● Machine learning teaches the computer
from the existing data to learn and understand the given rules
4. Origin
● Traditional databases with unstructured ● Existing data as well as algorithms
data
Data mining vs. Machine learning
5. Implementation
● We can develop our own models ● We can use machine learning algorithm in
where we can use data mining the decision tree, neural networks and some
techniques other area of artificial intelligence
6. Nature
● Involves human interference more ● Automated, once design self-implemented,
towards manual no human effort
7. Techniques involved
• Data mining is more a research using • Self-learned and trains system to do
methods like machine learning the intelligent task
8. Scope
• Applied in the limited area • Can be used in a vast area
Machine learning
Focus on supervised learning
● Regression: Predict a value of a given continuous
valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency
● Classification: Find a model for class attribute as a
function of the values of other attributes
Regression
Given few predictors 𝑋 = (𝑋1 , … , 𝑋|𝑋| ) referring to values of
variables under analysis (|𝑋| is the number of features), the
goals is choosing a model 𝑓 for the relation 𝑌 = 𝑓 𝑋 + 𝜖
to estimate unknow values 𝑌 = 𝑓(𝑋).
𝑌 could be a vector 𝑌 = (𝑌1 , … , 𝑌|𝑌| ).
Types of regressions
LINEAR REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽 𝑋 𝑋 𝑋 + 𝜖
REGRESSION WITH INTERACTIONS
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 ∗ 𝑋2 + ⋯
POLYNOMIAL REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽12 𝑋12 + ⋯ + 𝛽2 𝑋2 + 𝛽22 𝑋22 + ⋯
Performance indicators for regression
MAE (Mean Absolute Error): the average distance between
each real value and the predicted one
MSE (Mean Squared Error): the average squared difference
between the estimated values and the actual value
The lower the values of MAE and MSE the better the model
𝐑𝟐 (R squared): the proportion of the variance in the dependent
variable that is predictable from the independent variable
𝐑𝟐 close to 1 means most predicted values are equal to the original ones
𝐑𝟐 negative or close to 0 means wrong predictions
Classification
• Given a collection of records
• Each record is by characterized by a tuple (x,y),
where x is the attribute set and y is the class label
x: attribute, predictor, independent variable,
input
y: class, response, dependent variable, output
• Task:
Learn a model that maps each attribute set x into
one of the predefined class labels y
Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing Features extracted from spam or non-spam
email email message header and
messages content
Identifying Features extracted from x- malignant or benign
tumor cells rays or MRI scans cells
Cataloging Features extracted from Elliptical, spiral, or
galaxies telescope images irregular-shaped
galaxies
Regression vs. Classification
REGRESSION CLASSIFICATION
● The predictive model ● The predictive model
produces as output a produces as output a
numerical estimate class or a category
● For example: the ● For example : the
model analyzes some model analyzes some
characteristics of the characteristics of the
houses and gives flowers and labels
them an estimate of them with the species
the market price
Approach for Building Classification Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Example of a Decision Tree
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower Home
1 Yes Single 125K No
Owner
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
Income NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Data Model: Decision Tree
Another Example of Decision Tree
MarSt Single,
Married Divorc
ed
NO Home
Home Marital Annual Defaulted
ID Yes Owner No
Owner Status Income Borrower
1 Yes Single 125K No
NO Income
2 No Married 100K No
< 80K > 80K
3 No Single 70K No
4 Yes Married 120K No NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree
that fits the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Apply model to new data
Start from the root of tree.
Test Data
Home Home Marital Annual Defaulted
Owner Owner Status Income Borrower
Yes No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Home
Yes Owner No No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Home Marital Annual Defaulted
Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Assign Defaulted to
Income NO “ No”
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Decision tree
• Two categories
trees used for regression problems
trees used for classification problems
• This means that this algorithm can be used both when the
dependent variable is continuous and when it is categorical
• Many Algorithms
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise
Can easily handle redundant or irrelevant attributes (unless the
attributes are interacting)
Disadvantages:
Space of possible decision trees is exponentially large. Greedy
approaches are often unable to find the best tree.
Does not take into account interactions between attributes
Each decision boundary involves only a single attribute
Other classification techniques
• Logistic regression
Uses a logistic function to model a binary dependent variable
• Random forest
Ensemble method that constructs a multitude of decision trees
at training time and outputs the class that is the mode of the
classes of the individual trees
• Support vector machine (SVM)
Finds a hyperplane in an N-dimensional space, which separate
data belonging to the different classes
• Nearest-Neighbor (K-NN)
Use class labels of the K nearest neighbors to determine the
class label of unknown record (e.g., by taking majority vote)
• Neural networks
Performance indicators for classification
ACCURACY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝑇𝑁
=
𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
SENSITIVITY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑟𝑒𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑁
PRECISION
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑃
Prediction evaluation
A training dataset is a dataset of
examples used for learning, that is
to fit the parameters of the model.
Train Most approaches that search
Data
through training data for empirical
relationships tend to overfit the
Train- data, meaning that they can
test identify apparent rules in the
training data that do not hold in
Test
split
general.
Prediction evaluation
A test dataset is independent of
the training dataset, but that
Train
follows the same probability
distribution as the training dataset.
Data
In theory, If a model fit to the
training dataset also fits the test
Train- dataset well, otherwise it is the
case of overfitting.
test
A test set is a set used only to
Test
split
assess the performances of a
model.
Prediction evaluation
The validation dataset is used
to tune the hyperparameters (i.e.
Train
the architecture parameters) of a
Train model.
Data
It is used also to avoid overfitting.
In this sense the training dataset
is used to train the candidate
Train-
Test V
algorithms, the validation dataset
Validation is used to compare their
test performances and decide which
split
Test
split one to take and, finally, the test
dataset is used to obtain the
performances.
Data
test
split
Train-
Test Train
split
Cross
validation
Test V Train
Prediction evaluation
Test V Train
Test Train V
Test Train V
Model underfitting and overfitting
● Underfitting occurs
when a model can’t
capture the
dependencies among
data, usually as a
consequence of its own
simplicity
● Overfitting happens
when a model learns both
dependencies among
data and random
fluctuations (i.e., learns
the existing data too well)
Complex models, which
have many features or
terms, are often prone to
overfitting
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
Decision Tree with 50 nodes
Decision Tree
Decision boundaries on Training data
Which tree is better?
Decision Tree with 4 nodes
Which tree is better ?
Decision Tree with 50 nodes
● Training error does not provide a good estimate of how
well the tree will perform on previously unseen records
● Need ways for estimating generalization errors
Model Overfitting
• As the model becomes more and more complex, test errors can
start increasing even though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are
large
Overfitting: when model is too complex, training error is small but test
error is large
Model Overfitting
Using twice the number of data instances
• Increasing the size of training data reduces the difference
between training and testing errors at a given size of model