Advanced Database and Data
Mining
CS-513
Faculty-Dr Aruna Malik
Know your Data
Chapter 2: Getting to Know Your Data
• Data Objects and Attribute Types
• Basic Statistical Descriptions of Data
• Data Visualization
• Measuring Data Similarity and Dissimilarity
• Summary
2
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical matrix, crosstabs
– Document data: text documents:
term-frequency vector
– Transaction data
• Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
• Ordered
– Video data: sequence of images
– Temporal data: time-series
– Sequential Data: transaction sequences
– Genetic sequence data
• Spatial, image and multimedia:
– Spatial data: maps
– Image data:
– Video data:
3
Important Characteristics of Structured Data
• Dimensionality
– Curse of dimensionality
• Sparsity
– Only presence counts
• Resolution
– Patterns depend on the scale
• Distribution
– Centrality and dispersion
4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
5
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled
6
Attribute Types
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
– e.g., temperature in Kelvin, length, counts, monetary
quantities
8
Discrete vs. Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and
represented using a finite number of digits
– Continuous attributes are typically represented as
floating-point variables
9
Basic Statistical Descriptions of Data
• Motivation
– To better understand the data: central tendency, variation and
spread
• Data dispersion characteristics
–median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities of precision
– Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
– Boxplot or quantile analysis on the transformed cube
10
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values
• Median:
– Middle value if odd number of values, or average of the
middle two values otherwise
– Estimated by interpolation (for grouped data):
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula:
11
Symmetric vs. Skewed
Data
• Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data
positively skewed negatively skewed
12
* Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
13
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to
Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
14
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis repres. frequencies
• Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are ≤ xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
• Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
15
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars
• It shows what proportion of cases fall
into each of several categories
• Differs from a bar chart in that it is the
area of the bar that denotes the value,
not the height as in bar charts, a crucial
distinction when the categories are not
of uniform width
• The categories are usually specified as
non-overlapping intervals of some
variable. The categories (bars) must be
adjacent
16
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
Data Mining: Concepts and Techniques 17
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.
18
Scatter plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted as points in
the plane
19
Positively and Negatively Correlated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
20
Uncorrelated Data
21
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
22
Data Matrix and Dissimilarity Matrix
• Data matrix
– n data points with p
dimensions
– Two modes
• Dissimilarity matrix
– n data points, but
registers only the distance
– A triangular matrix
– Single mode
23
Dissimilarity between Binary Variables
• Example
– Gender is a symmetric attribute
– The remaining attributes are asymmetric binary
– Let the values Y and P be 1, and the value N 0
24
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
25