0% found this document useful (0 votes)

16 views38 pages

Week 1B - Data

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views38 pages

Week 1B - Data

Uploaded by

Hafidz Nur shafwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

1604C331 Data Mining

Week 1B:
Data

Odd Semester 2024-2025

20102620240829
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Types of Data

2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes

• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10

10 No Single 90K Yes

• in a university database: student, professor, course

– Object is also known as record, point, case, sample, entity, instance.
• The distribution of data involving 1 attribute is called univariate.
A bivariate distribution involves 2 attributes, …
A sample dataset (student info)
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
• Same attribute can be mapped to different attribute values
– Examples: height can be measured in feet or meters
• Different attribute can be mapped to the same set of values
– Examples: attribute values for ID and age are integers.
• Attribute properties can be different than the values properties used
to represent the attribute.
Measurement of Length
• The way measuring an attribute may not match the attribute
properties.
Properties of Attribute Values
• A useful (and simple) way to specify the type of
an attribute is to identify the properties of
numbers that correspond to underlying
properties of the attribute.
• Example:
– An attribute such as length has many of the
properties of numbers.
– It makes sense to compare and order objects by
length, as well as to talk about the differences and
ratios of length.
Attribute Types

• Each attribute possesses

all the properties and
operations of the attribute
types.
• The definition of the
attribute types is
cumulative: any property
or operation that is valid
for nominal, ordinal, and
interval attributes is also
valid for ratio attributes.
Attributes by the number of values
• DISCRETE attribute (typically, nominal and ordinal attributes)
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes and assume only
2 values (true/false, yes/no, male/female, 0/1)
• CONTINUOUS attribute (typically, interval and ratio attributes)
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes
• The outcomes of the states are not equally important. One state is
interpreted as more informative than the other state.

• Only presence (a non-zero attribute value) is regarded as important

– Words present in document
– Items present in customer transactions

• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data

20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.

1 n
x =  xi =  x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n

w
i =1
i

– Trimmed mean: chopping extreme values

Measuring the central tendency (2)
• Median: middle value if odd number of values, or average of the middle
2 values otherwise.
– Estimated by interpolation (for grouped data)

Approximate Sum before the median interval

median
n / 2 − ( freq) l Interval width (L2 – L1)
median = L1 + ( ) width
freqmedian
Low interval limit
Measuring the central tendency (3)
• Mode: value that occurs most frequently in the data
– Unimodal
• Empirical formula: mean − mode = 3  (mean − median)

– Multi-modal: bimodal, trimodal

Symmetric vs Skewed Data

symmetric negatively skewed

positively skewed
Symmetric vs Skewed Data
Measuring the dispersion of data (1)
Quartiles, outliers, and boxplots

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked, add
whiskers, and plot outliers individually.
• Outlier: usually, a value higher or lower than 1.5 times IQR.
Measuring the dispersion of data (2)
Variance and standard deviation (sample: s, population: σ)
1 n 1 n 2 1 n 2
• Variance: s =
2

n − 1 i =1
( xi − x ) =
2
[ xi − ( xi ) ]
n − 1 i =1 n i =1
n n
1 1
 =  ( xi −  ) =  i − 
2 2 22
x
N i =1 N i =1

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)

Note: The subtle difference of formulae for
sample vs. population
• n : the size of the sample
• N : the size of the population
Boxplot Analysis
• Five-number summary of a distribution:
– minimum, Q1, median, Q3, maximum.
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles
– The height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to minimum and maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency

Graphic Displays of Basic Statistical Descriptions

• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30

• Histogram: graph display of tabulated 25

20
frequencies, shown as bars 15

• It shows what proportion of cases fall into 105

each of several categories 0
10000 30000 50000 70000 90000

• Differs from a bar chart in that it is the area of

the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variables. The
categories (bars) must be adjacent.
Histogram Often Tells More than Boxplot
• Two histograms shown
on the right may have the
same boxplot
representation:
– the same values for: min,
Q1, median, Q3, and max.

• But, they have rather

different data distributions
Quantile Plot
• Display all the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order
– fi indicates that approximately 100 fi% of the data are below or equal to
the value xi.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
Positively and Negatively Correlated Data

• The left half fragment is positively

correlated
• The right half is negative correlated
Uncorrelated Data
Exercises

40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:

Compute an approximate median value for the data.

Basic Statistics Exercise
Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:

a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?

48
Informatics Engineering | Universitas Surabaya

Google Certified Professional Cloud Architect
100% (1)
Google Certified Professional Cloud Architect
446 pages
Maccura F560 - F 580 (Hematology Analyser)
No ratings yet
Maccura F560 - F 580 (Hematology Analyser)
29 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Michael Hennerich - Multichannel Phase Coherent System
No ratings yet
Michael Hennerich - Multichannel Phase Coherent System
64 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
No ratings yet
OceanofPDF - Com Hacking MySQL Breaking Optimizing - Lukas Vileikis
381 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
No ratings yet
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
32 pages
Vlsi Interview Questions
0% (1)
Vlsi Interview Questions
10 pages
828D PLC FCT Man 0721 en-US
No ratings yet
828D PLC FCT Man 0721 en-US
356 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
MCA 2 Year Syllabus
No ratings yet
MCA 2 Year Syllabus
99 pages
Thesis Asset Management Client Login
100% (2)
Thesis Asset Management Client Login
4 pages
SPiCE for Software Process Improvement
No ratings yet
SPiCE for Software Process Improvement
22 pages
Google Research: 3D Vision & Robotics
No ratings yet
Google Research: 3D Vision & Robotics
35 pages
Overview On DBS
No ratings yet
Overview On DBS
30 pages
Pi RS485&CAN Module User Manual - V1.3
No ratings yet
Pi RS485&CAN Module User Manual - V1.3
27 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
BMS Manual
100% (2)
BMS Manual
274 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
No ratings yet
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
44 pages
Xs Max
No ratings yet
Xs Max
5 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Ngo Management System
No ratings yet
Ngo Management System
12 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
02 Data
No ratings yet
02 Data
66 pages
V Unit
No ratings yet
V Unit
27 pages
Transcript - Participate Safely and Responsibly Online PDF
No ratings yet
Transcript - Participate Safely and Responsibly Online PDF
11 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
One-Step Image Translation Method
No ratings yet
One-Step Image Translation Method
29 pages
Main Project 2021 Zeroth
No ratings yet
Main Project 2021 Zeroth
9 pages
About Data
No ratings yet
About Data
25 pages
02 Data
No ratings yet
02 Data
35 pages
Image Processing Basics & Applications
No ratings yet
Image Processing Basics & Applications
10 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
02 Data
No ratings yet
02 Data
65 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Big Data Project-2 Report
No ratings yet
Big Data Project-2 Report
22 pages
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
No ratings yet
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
8 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Module 1
No ratings yet
Module 1
64 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Grade 6 - Term 2 - Sample Paper - Answer Key
No ratings yet
Grade 6 - Term 2 - Sample Paper - Answer Key
9 pages
Lect 3
No ratings yet
Lect 3
51 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
CH 2
No ratings yet
CH 2
68 pages
CH 2
No ratings yet
CH 2
35 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Week2 1
No ratings yet
Week2 1
24 pages
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
No ratings yet
Home HTTPD Data Media-Data 4 PhotonX25 ASIO
4 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
02 Data
No ratings yet
02 Data
24 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Gek106913 B
No ratings yet
Gek106913 B
4 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
SMTS File - 1 RS20200105 2020 05 19 14 - 26 - 04
No ratings yet
SMTS File - 1 RS20200105 2020 05 19 14 - 26 - 04
2 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
02 Data
No ratings yet
02 Data
41 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
CLA Guitars
No ratings yet
CLA Guitars
13 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Quick Start Guide: CR-HD PRO Diagnostic Tool
No ratings yet
Quick Start Guide: CR-HD PRO Diagnostic Tool
2 pages

Week 1B - Data

Uploaded by

Week 1B - Data

Uploaded by

1604C331 Data Mining

Odd Semester 2024-2025

– Examples of attribute: 1 Yes Single 125K No

• eye color of a person 2 No Married 100K No

• temperature 3 No Single 70K No

feature. 5 No Divorced 95K Yes

• An object is described by a collection of attributes (attribute 6 No Married 60K No

vector or feature vector). 7 Yes Divorced 220K No

– Examples of objects: 8 No Single 85K Yes

• in a sales database: customer, store item, sales 9 No Married 75K No

• in a medical database: patient 10

• in a university database: student, professor, course

• Each attribute possesses

• Only presence (a non-zero attribute value) is regarded as important

– Trimmed mean: chopping extreme values

Approximate Sum before the median interval

– Multi-modal: bimodal, trimodal

symmetric negatively skewed

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)

Represent central tendency

• Histogram: graph display of tabulated 25

• It shows what proportion of cases fall into 105

• Differs from a bar chart in that it is the area of

• But, they have rather

• The left half fragment is positively

Compute an approximate median value for the data.

You might also like