0% found this document useful (0 votes)

84 views64 pages

17 Data Analysis

This document provides an overview of data analysis and management techniques. It discusses different data types like nominal, ordinal, interval and ratio data. It also describes important data characteristics like dimensionality, sparsity and size. Finally, it covers key steps in data preprocessing like handling noise, missing values, duplicates and wrong data. It discusses techniques for data transformation such as aggregation, sampling, dimensionality reduction, discretization and attribute transformation.

Uploaded by

Enri Gjondrekaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views64 pages

17 Data Analysis

Uploaded by

Enri Gjondrekaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Analysis and Management

of Production System
Lesson 17: Data analysis

Prof. Giulia Bruno

Department of Management and

Production Engineering

[email protected]
Dataset

Attributes
ID COLOR AGE WEIGHT ON
Collection of data
1 Orange 24 70.24 Yes
objects and their
attributes. 2 Blue 12 56.45 Yes
3 Red 58 67.23 Yes
An attribute is a
4 Orange 43 62.50 Yes
property or
Objects 5 Orange 18 37.47 No
characteristic of an
6 Blue 19 81.35 No
object.
7 Green 62 44.45 Yes
A collection of
8 Orange 33 23.34 No
attributes describe an
9 Green 20 26.35 No
object.
10 Red 47 57.89 Yes
11 Orange 39 52.98 No
12 Green 30 87.43 Yes
13 Blue 29 77.79 No
Types of data
1 NOMINAL 1. DISTINCTNESS
ID numbers, eye color, gender =, ≠
1, 2 ORDINAL 2. ORDER
rankings, hardness of minerals, grades <, >
1, 2, 3 INTERVAL 3. DIFFERENCES
calendar dates, temperatures in Celsius, … +, −
1, 2, 3, 4 RATIO 4. RATIO
temperature in Kelvin, length, counts… ∗,/
Record Data

● Data matrix Projection Projection Distance Load Thickness

of x Load of y load
 same fixed set of numeric
attributes 10.23 5.27 15.22 2.7 1.2

 data objects can be thought of as 12.65 6.25 16.22 2.2 1.1

points in a multi-dimensional
space, where each dimension

timeout

season
coach

game
score
play
represents a distinct attribute

team

win
ball

lost
● Document data
 each term is a component Document 1 3 0 5 0 2 6 0 2 0 2

(attribute) of the vector Document 2 0 7 0 2 1 0 0 3 0 0

 the value of each component is the Document 3 0 1 0 0 1 2 2 0 3 0

number of times the corresponding
term occurs in the document TID Items
● Transaction data 1 Bread, Coke, Milk
 each transaction involves a set of 2 Beer, Bread
items (e.g., set of products 3 Beer, Coke, Diaper, Milk
purchased by a customer during a
shopping session) 4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

Examples: generic graphs, molecules, webpages

2
5 1
2
5

Benzene Molecule: C6H6

Ordered Data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
● Sequence of data CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

● Sequences of transactions

● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Important characteristics of data

● Dimensionality (number of attributes)

 High dimensional data brings a number of challenges

● Sparsity
 Only presence counts

● Resolution
 Patterns depend on the scale

● Size
 Type of analysis may depend on size of data
From data to knowledge (KDD)
Data preprocessing/cleaning

The representation and quality of data is first and foremost before

running any analysis

If there is much irrelevant and redundant information or noisy and

unreliable data, then knowledge discovery is more difficult

EXAMPLES OF DATA QUALITY PROBLEMS

● Noise and outliers
● Missing values
● Duplicate data
● Wrong data
Noise and outliers

Outliers are data objects with characteristics that are

considerably different than most of the other data objects in
the data set
● Case 1: outliers are noise that interferes with data
analysis
● Case 2: outliers are the goal of the analysis

Box plot is a useful graphical method for

describing the behavior of the data and
identify outliers

Examples of applications: detection of

credit card frauds (anomalous set of
purchases), detection of unusual pattern in
medical diagnosis
Missing values

Reasons for missing values

● Information is not collected
● Attributes may not be applicable to all cases

Handling missing values

● Eliminate data objects or attribute
● Estimate missing values
● Ignore the missing value during analysis
Duplicate data

Data set may contain data objects that are duplicates, or

almost duplicates of one another (for example same person,
object, with multiple email addresses)

It can happen when data from heterogeneous sources are

merged

When is a data considered duplicate?

When the similarity between data is equal to 1
Similarity
● Numerical measure of how alike two data objects are
● It is higher when objects are more alike
● Often falls in the range [0,1]

It is possible to use different distance definitions to express the

similarity

ATTRIBUTE TYPE DISTANCE (DISSIMILARITY)

0 𝑖𝑓 𝑥 = 𝑦
Nominal d(𝑥, 𝑦) = ቊ
1 𝑖𝑓 𝑥 ≠ 𝑦
𝑥−𝑦
𝑑 𝑥, 𝑦 =
Ordinal 𝑛−1
(values mapped to integers 0 to n-1
where n is the number of values)
𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 ,
Interval or Ratio
𝑑 𝑥, 𝑦 = σ𝑛𝑘=1 𝑥𝑘 − 𝑦𝑘 2

Similarity: s = 1 − d
Wrong data

The value of a certain attribute can depend on the values of some

others, so it’s possible to check the satisfaction of the dependence
Example: the italian fiscal code which depends on the personal data
Using as example the date of birth we can easily find out which of
these three record are correct (last two numbers of the birth year,
letter representing the month, two numbers for the day)

CSTGPP73A25I452U 25/01/1973
NDDPRI82E30A859Z 31/05/1982
CSTGPP00A01G732I 01/01/2000
Data Transformation

Techniques for data transformation

● Aggregation
● Sampling
● Dimensionality Reduction
 Feature selection
 Feature creation
 Mapping data to new space

● Discretization and Binarization

● Attribute Transformation
Aggregation

● Combining two or more attributes (or objects) into a

single attribute (or object)

● Purpose
 Data reduction
• Reduce the number of attributes or objects
 Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
 More “stable” data
• Aggregated data tends to have less variability
Sampling

● Main technique used for data reduction

 considering the entire set of data of interest can be too
expensive or time consuming

● Key principle: using a sample will work almost as well

as using the entire data set, if the sample is
representative
 a sample is representative if it has approximately the same
properties as the original set of data

8000 points 2000 Points 500 Points

Dimensionality Reduction

Reduction of the dimensionality of the dataset, i.e., the number of

attributes
 reduce amount of time and memory required by algorithms
 allow data to be more easily visualized
 may help to eliminate irrelevant features or reduce noise

● Techniques
 Feature selection: remove redundant features (e.g., the purchase
price of a product has the same information of the amount of sales
tax paid) or irrelevant features (e.g., students' ID is irrelevant to the
task of predicting students' marks)
 Feature creation: create new attributes that can capture the
important information in a data set much more efficiently than the
original attributes
• Feature extraction (e.g., extracting edges from images)
• Feature construction (e.g., dividing mass by volume to get density)
• Mapping data to new space (e.g., Fourier and wavelet analysis)
Feature selection vs. Feature extraction

FEATURE SELECTION
𝑋1 , … , 𝑋𝑝 → 𝑋𝑘1 , … , 𝑋𝑘𝑚

FEATURE EXTRACTION
𝑋1 , … , 𝑋𝑝 → 𝑍1 , … , 𝑍𝑚
𝑍1 , … , 𝑍𝑚 = 𝑓1 (𝑋1 , … , 𝑋𝑝 ), … , 𝑓𝑚 (𝑋1 , … , 𝑋𝑝 )
Mapping data to a new space

Time domain Frequency domain

Discretization

● Process of converting a continuous attribute into an ordinal

attribute
 A potentially infinite number of values are mapped into a
small number of categories
 Discretization is commonly used in classification
 Many classification algorithms work best if both the
independent and dependent variables have only a few values
Binarization

● Binarization maps a continuous or categorical attribute into

one or more binary variables

● Typically used for association analysis

● Often convert a continuous attribute to a categorical attribute

and then convert a categorical attribute to a set of binary
attributes
Attribute Transformation

● An attribute transformation is a function that maps

the entire set of values of a given attribute to a new set
of replacement values such that each old value can be
identified with one of the new values
 Simple functions: xk, log(x), ex, |x|
 Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean,
variance, range
• Take out unwanted, common signal, e.g., seasonality
 In statistics, standardization refers to subtracting off the
means and dividing by the standard deviation
Data mining
Predictive modeling – supervised learning

● Regression: Predict a value of a given continuous

valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency

● Classification: Find a model for class attribute as a

function of the values of other attributes
Clustering – unsupervised learning

● Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Association Rule

● Given a set of records each of which contain

some number of items from a given collection
 Produce dependency rules which will predict occurrence
of an item based on occurrences of other items

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Deviation/ Anomaly/Change Detection

• Detect significant deviations from normal

behavior

• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
• Identify anomalous behavior from sensor
networks for monitoring and surveillance
• Detecting changes in the global forest cover
Data mining vs. Machine learning

In the 1960s, statisticians and economists used terms like data fishing or data
dredging to refer to what they considered the bad practice of analyzing data
without an a-priori hypothesis. The term "data mining" was used in a similarly
critical way by economist Michael Lovell in an article published in the Review of
Economic Studies in 1983.
Lovell, Michael C., Data Mining (1983). The Review of Economics and Statistics.

Data mining, also called knowledge discovery in databases, in computer science,

is the process of discovering interesting and useful patterns and relationships in
large volumes of data. The field combines tools from statistics and artificial
intelligence (such as neural networks and machine learning) with database
management to analyze large digital collections, known as data sets.

Christopher Clifton, Data mining (2019). Encyclopædia Britannica, inc.

Data mining vs. Machine learning

Machine learning algorithms build a mathematical model based on

sample data, known as "training data", in order to make predictions or
decisions without being explicitly programmed to perform the task.
Bishop, C. M., Pattern Recognition and Machine Learning (20006). Springer.

The definition "without being explicitly programmed" is often attributed

to Arthur Samuel, who coined the term "machine learning" in 1959.
Data mining vs. Machine learning

1. Meaning
● Extracting knowledge from a large ● Extracting new algorithms from data as
amount of data well as experience
2. History
● Introduced in 1930, initially referred as ● Introduce in near 1950, the first program
knowledge discovery in databases was Samuel’s checker-playing program

3. Responsibility
● Data mining is used to get the rules ● Machine learning teaches the computer
from the existing data to learn and understand the given rules

4. Origin
● Traditional databases with unstructured ● Existing data as well as algorithms
data
Data mining vs. Machine learning

5. Implementation
● We can develop our own models ● We can use machine learning algorithm in
where we can use data mining the decision tree, neural networks and some
techniques other area of artificial intelligence
6. Nature
● Involves human interference more ● Automated, once design self-implemented,
towards manual no human effort

7. Techniques involved
• Data mining is more a research using • Self-learned and trains system to do
methods like machine learning the intelligent task

8. Scope
• Applied in the limited area • Can be used in a vast area
Machine learning
Focus on supervised learning

● Regression: Predict a value of a given continuous

valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency

● Classification: Find a model for class attribute as a

function of the values of other attributes
Regression

Given few predictors 𝑋 = (𝑋1 , … , 𝑋|𝑋| ) referring to values of

variables under analysis (|𝑋| is the number of features), the
goals is choosing a model 𝑓 for the relation 𝑌 = 𝑓 𝑋 + 𝜖
to estimate unknow values 𝑌෠ = 𝑓(𝑋).
𝑌 could be a vector 𝑌 = (𝑌1 , … , 𝑌|𝑌| ).
Types of regressions

LINEAR REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽 𝑋 𝑋 𝑋 + 𝜖

REGRESSION WITH INTERACTIONS

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 ∗ 𝑋2 + ⋯

POLYNOMIAL REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽12 𝑋12 + ⋯ + 𝛽2 𝑋2 + 𝛽22 𝑋22 + ⋯
Performance indicators for regression

MAE (Mean Absolute Error): the average distance between

each real value and the predicted one

MSE (Mean Squared Error): the average squared difference

between the estimated values and the actual value

The lower the values of MAE and MSE the better the model

𝐑𝟐 (R squared): the proportion of the variance in the dependent

variable that is predictable from the independent variable
𝐑𝟐 close to 1 means most predicted values are equal to the original ones
𝐑𝟐 negative or close to 0 means wrong predictions
Classification

• Given a collection of records

• Each record is by characterized by a tuple (x,y),
where x is the attribute set and y is the class label
 x: attribute, predictor, independent variable,
input
 y: class, response, dependent variable, output

• Task:
 Learn a model that maps each attribute set x into
one of the predefined class labels y
Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam

email email message header and
messages content

Identifying Features extracted from x- malignant or benign

tumor cells rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or

galaxies telescope images irregular-shaped
galaxies
Regression vs. Classification

REGRESSION CLASSIFICATION
● The predictive model ● The predictive model
produces as output a produces as output a
numerical estimate class or a category

● For example: the ● For example : the

model analyzes some model analyzes some
characteristics of the characteristics of the
houses and gives flowers and labels
them an estimate of them with the species
the market price
Approach for Building Classification Model

Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Example of a Decision Tree

Splitting Attributes

Home Marital Annual Defaulted

ID
Owner Status Income Borrower Home
1 Yes Single 125K No
Owner
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
Income NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Data Model: Decision Tree

Another Example of Decision Tree

MarSt Single,
Married Divorc
ed
NO Home
Home Marital Annual Defaulted
ID Yes Owner No
Owner Status Income Borrower
1 Yes Single 125K No
NO Income
2 No Married 100K No
< 80K > 80K
3 No Single 70K No
4 Yes Married 120K No NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree
that fits the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Apply model to new data

Start from the root of tree.

Test Data
Home Home Marital Annual Defaulted
Owner Owner Status Income Borrower
Yes No
No Married 80K ?
10