1604C331 Data Mining
Week 1B:
Data
Odd Semester 2024-2025
20102620240829
Informatics Engineering
Faculty of Engineering | Universitas Surabaya
Types of Data
2
Informatics Engineering | Universitas Surabaya
What is Data/Dataset? attributes
• Data/Dataset is a collection of data objects and their attributes. Tid Refund Marital Taxable
• An attribute is a property of characteristic of an object. Status Income Cheat
– Examples of attribute: 1 Yes Single 125K No
• eye color of a person 2 No Married 100K No
• temperature 3 No Single 70K No
Objects
– Attribute is also known as variable, field, characteristic, dimension, 4 Yes Married 120K No
feature. 5 No Divorced 95K Yes
• An object is described by a collection of attributes (attribute 6 No Married 60K No
vector or feature vector). 7 Yes Divorced 220K No
– Examples of objects: 8 No Single 85K Yes
• in a sales database: customer, store item, sales 9 No Married 75K No
• in a medical database: patient 10
10 No Single 90K Yes
• in a university database: student, professor, course
– Object is also known as record, point, case, sample, entity, instance.
• The distribution of data involving 1 attribute is called univariate.
A bivariate distribution involves 2 attributes, …
A sample dataset (student info)
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
• Same attribute can be mapped to different attribute values
– Examples: height can be measured in feet or meters
• Different attribute can be mapped to the same set of values
– Examples: attribute values for ID and age are integers.
• Attribute properties can be different than the values properties used
to represent the attribute.
Measurement of Length
• The way measuring an attribute may not match the attribute
properties.
Properties of Attribute Values
• A useful (and simple) way to specify the type of
an attribute is to identify the properties of
numbers that correspond to underlying
properties of the attribute.
• Example:
– An attribute such as length has many of the
properties of numbers.
– It makes sense to compare and order objects by
length, as well as to talk about the differences and
ratios of length.
Attribute Types
• Each attribute possesses
all the properties and
operations of the attribute
types.
• The definition of the
attribute types is
cumulative: any property
or operation that is valid
for nominal, ordinal, and
interval attributes is also
valid for ratio attributes.
Attributes by the number of values
• DISCRETE attribute (typically, nominal and ordinal attributes)
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes and assume only
2 values (true/false, yes/no, male/female, 0/1)
• CONTINUOUS attribute (typically, interval and ratio attributes)
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes
• The outcomes of the states are not equally important. One state is
interpreted as more informative than the other state.
• Only presence (a non-zero attribute value) is regarded as important
– Words present in document
– Items present in customer transactions
• If we met a friend in the grocery store would we ever say the following?
“I see your purchases are very similar since we didn’t buy most of the same
things.”
Types of Dataset
• Record
– Relational records
– Data matrix: numerical matrix, crosstabs
– Document data: text document, term-frequency vector
– Transaction data
• Graph and Network
– World Wide Web
– Social or information networks
– Molecular structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential data: transaction sequences
– Genetic sequence data
– Spatial data: maps
Benzene Molecule: C6H6
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
General Characteristics of Datasets
• Dimensionality
– Curse of dimensionality
• Distribution
– Centrality and dispersion
• Resolution
– Pattern depends on the scale
Statistics of Data
20
Informatics Engineering | Universitas Surabaya
Basic Statistical Descriptions of Data
• For data preprocessing to be successful, understand the data.
• Measures of central tendency: measure the location of the middle or center of
a data distribution.
– Given an attribute, where do most of its values fall?
– Mean, median, mode, …
• Dispersion of data
– How are the data spread out?
– Range, quartiles, interquartile range, five-number summary and boxplot, variance,
std, outlier.
• Describe relations among multiple variables
– Numerical data: co-variance and correlation coefficient
– Nominal data: 𝛘2 correlation test
• Visually inspect data using graphic displays
– Bar charts, pie charts, line graphs, histogram, scatter plots
Measuring the central tendency (1)
• Mean: n is sample size and N is population size.
1 n
x = xi = x
n i =1 N
n
– Weighted arithmetic mean w x i i
x= i =1
n
w
i =1
i
– Trimmed mean: chopping extreme values
Measuring the central tendency (2)
• Median: middle value if odd number of values, or average of the middle
2 values otherwise.
– Estimated by interpolation (for grouped data)
Approximate Sum before the median interval
median
n / 2 − ( freq) l Interval width (L2 – L1)
median = L1 + ( ) width
freqmedian
Low interval limit
Measuring the central tendency (3)
• Mode: value that occurs most frequently in the data
– Unimodal
• Empirical formula: mean − mode = 3 (mean − median)
– Multi-modal: bimodal, trimodal
Symmetric vs Skewed Data
symmetric negatively skewed
positively skewed
Symmetric vs Skewed Data
Measuring the dispersion of data (1)
Quartiles, outliers, and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked, add
whiskers, and plot outliers individually.
• Outlier: usually, a value higher or lower than 1.5 times IQR.
Measuring the dispersion of data (2)
Variance and standard deviation (sample: s, population: σ)
1 n 1 n 2 1 n 2
• Variance: s =
2
n − 1 i =1
( xi − x ) =
2
[ xi − ( xi ) ]
n − 1 i =1 n i =1
n n
1 1
= ( xi − ) = i −
2 2 22
x
N i =1 N i =1
• Standard deviations: s (or σ) is the square root of variance s2 (or σ2)
Note: The subtle difference of formulae for
sample vs. population
• n : the size of the sample
• N : the size of the population
Boxplot Analysis
• Five-number summary of a distribution:
– minimum, Q1, median, Q3, maximum.
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles
– The height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to minimum and maximum
– Outliers: points beyond a specified outlier threshold, plotted individually
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→
Represent central tendency
Graphic Displays of Basic Statistical Descriptions
• Boxplot
– graphic display of five-number summary
• Histogram
– x-axis are values
– y-axis represents frequencies
• Quantile plot
– each value xi is paired with fi indicating that approximately 100 fi% of data are ≤
xi
• Quatile-quantile (q-q) plot
– graphs the quantiles of one univariant distribution against the corresponding
quantiles of another.
• Scatter plot
– each pair of values is a pair of coordinates and plotted as points in the plane.
Histogram Analysis 40
35
30
• Histogram: graph display of tabulated 25
20
frequencies, shown as bars 15
• It shows what proportion of cases fall into 105
each of several categories 0
10000 30000 50000 70000 90000
• Differs from a bar chart in that it is the area of
the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variables. The
categories (bars) must be adjacent.
Histogram Often Tells More than Boxplot
• Two histograms shown
on the right may have the
same boxplot
representation:
– the same values for: min,
Q1, median, Q3, and max.
• But, they have rather
different data distributions
Quantile Plot
• Display all the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order
– fi indicates that approximately 100 fi% of the data are below or equal to
the value xi.
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted as
points in the plane.
Positively and Negatively Correlated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
Uncorrelated Data
Exercises
40
Informatics Engineering | Universitas Surabaya
Homework
• Do all the exercises.
• You can write the solution on papers or you can use tools like Excel
or Python and explain in detail step-by-step of your work until it finds
the solution.
• Create a pdf file for your solution and submit it to ULS
• You can upload one more file that you use to do the computation
(.xlsx or .ipynb) along with your .pdf file. Upload those files
separately.
• Note: do not forget to put your Student ID and name at the first page
of the pdf file.
Median Exercise
Suppose that the value for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows:
Compute an approximate median value for the data.
Basic Statistics Exercise
Suppose that a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:
a. Calculate the mean, median, and standard deviation of age and %fat.
b. Draw the boxplots for age and %fat.
c. Draw a scatter plot (and optional: q-q plot) based on these two variables
Question?
48
Informatics Engineering | Universitas Surabaya