Data Types
• Data sets are made up of data objects
• Data objects also referred as
• Samples
• Examples
• Instances
• Data points
• Objects
• tuples
• Data objects are described by attributes
• Rows in a database correspond to data object
• Columns correspond to attributes
• What is an attribute?
• A data field
• Representing a characteristic /feature of a data object
• Synonyms
• Dimension (used in data warehousing)
• Feature ( ML)
• Variable(statistician)
• DM and data base professionals use the term attribute
• Attributes describing customer (Customer_ID, name, address)
• Set of attributes that describes a given object is called
• Attribute vector/feature vector
• If distribution of dataset involves only one attribute (height) – univariate
• If distribution of dataset involves two attributes (height, weight)-bivariate
• Type of attributes
• Nominal
• Binary
• Ordinal
• Numeric
Nominal
• Means ‘relating to names’
• The values are symbols or names of things
• Each value represents
• Category
• Code
• State
• Also called ‘categorical’
• In computer science, it is called ‘enumerations’
Example 2.1. Nominal attributes
• Attribute name: hair_color
• Attribute values: black, brown, blond, red, gray, white
• Attribute name: marital_status
• Attribute values: single, married, divorced, widowed
• Attribute name: occupation
• Attribute values: teacher, dentist, programmer, farmer etc.,
• Also possible to represent symbols /names with numbers
• Attribute name: hair_color
• Attribute values: black(0), brown(1), blond(2), red(3), gray(4), white(5)
• Attribute name:customer_id
• Attribute values: 1,2,3,.. , etc
• The numbers are not to be used quantitatively
• Mathematical operations on values of nominal attributes are not meaningful
• It makes no sense to subtract one CUSTOMER_ID from another
• Nominal attribute may have integers as values
• It is not considered as a numeric values
• Coz, nominal values do not have any meaningful order & are not quantitative
• It makes no sense to find the mean, median for nominal attributes
• One thing that is of interest, is most commonly occurring value, that is mode
• Is a measure of tendency
Binary Attributes
• Is a nominal attribute with only two categories or states: 0 or 1
• 0 typically means that the attribute is absent
• 1 means that it is present
• Binary attributes are referred to as ‘Boolean’ if the two states correspond to
• True/false
Example 2.2: Binary Attributes
• Attribute name: smoker
• Attribute value: 1 (indicates that the patient smokes)
• Attribute value: 0(indicates that the patient does not)
• Attribute name: medical_test
• Attribute value:1 (result of the test is positive)
• Attribute value:0(result of the test is negative)
• A binary attribute is symmetric
• If
• states are equally important
• Carry the same weight
• Eg:
• Attribute Name: Gender
• Attribute Values: 1(Male); 0(Female)
• A binary attribute is asymmetric
• If
• Two states are not equally important
• One state is usually rarer or more significant than the other
• By convention, the rarer /the most significant value is coded as 1, other as 0
• Eg:
• Attribute name: HIV
• Attribute values: 1(HIV Positive), 0(HIV negative)
Ordinal Attributes
• have a meaningful order/ranking among the values
• Magnitude b/w successive values is not known
• Example 2.3: ordinal attributes
• Attribute Name: dress_size
• Attribute values: S, M, L, XL,XXL
• Attribute Name: grade
• Attribute values:S,A+,A,B+,B., etc.,
• Attribute Name:professional_rank
• Attribute values: assistant, associate, professor
• Attribute name: sat_survey
• Attribute values: very dissatisfied, dissatisfied, neutral, satisfied, and very satisfied
• The values have a meaningful sequence
• We cannot tell from the values “how much bigger”
• ordinal attributes are obtained from discretization of numeric quantities
• By splitting the value range into finite number of ordered categories
• Central tendency of an ordinal attribute can be represented by
• Mode
• Median
• Nominal, binary and ordinal attributes are qualitative
• Numeric Attributes
• Quantitative
• Represented in integer or real values
• Attributes can be interval-scaled/ratio-scaled
• Interval-scaled attributes
• Measures on a scale of equal-size units
• Values have an order
• Can be positive , 0, negative
• Allows to compare and quantify the difference b/w values
• Example 2.4. Interval-scaled attributes
• value ‘zero’ doesn’t mean nothing
• Can calculate difference between values
• Can compute mean, median and mode
• 0°C doesn’t mean "no temperature“ – just a reference point
• The difference between 10°C and 20°C (10°C) is meaningful
• but you can’t say 20°C is "twice as hot" as 10°C
• The difference between 3:00 and 4:00 is 1 hour
• but saying "4:00 is twice as much as 2:00" doesn’t make sense
• Ratio-scaled attributes
• The zero value means ‘nothing’ or ‘absence of the quantity’
• Can calculate both differences & ratios
• Can compute mean, median, mode & multiplicative measures like ratio
• Example
• 0K means ‘no thermal energy’
• 0kg means no weight
• 100 kg is twice as 50 kg
• $0 means no money; $100 is 100 times more than $1
• Discrete Vs. Continuous attributes
• Discrete attribute
• Has a finite or countably infinite number of values
• May or may not be presented as integers
• Eg: hair_color has finite number of values, thus discrete
Statistics of Data
• Basic statistical methods
• identifies properties of data
• Highlights which data values should be treated as noise/outliers
• Measures of central tendency
• Mean
• Median
• Mode
• Midrange
• Dispersion of data
• Range
• Quartiles
• Interquartile range
• Five number summary & boxplots
• Variance
• Standard deviation
• Description of relations among multiple variables
• For numerical data
• Co-variance
• Correlation coefficient
• For nominal data
• correlation
• Statistical /graphical data presentation software packages includes
• Bar charts
• Pie charts
• Line graphs
• Data summaries and distributions packages include
• Quantile plots
• Quantile-quantile plots
• Histograms
• Scatter plots
Measuring the central tendency
Mean
• Measure of ‘center’
• Where,
- Set of N values /observations
• Built-in aggregate function is average, (avg()in SQL)
Weighted arithmetic mean or the weighted average
• Weighted arithmetic mean or the weighted average
(50×0.10)+(60×0.05)+(70×0.15)+(80×0.20)+(90×0.25)+(85×0.10)+(75×0.05)+(95×0.05)+(65×0.03)+(55×0.02)
0.10+0.05+0.15+0.20+0.25+0.10+0.05+0.05+0.03+0.02
77.05
=77.05
1.00
• Problem with mean
• Sensitivity to outliers, Extreme values corrupt the mean
• Salary drawn by ten employees
Trimmed mean
• Should avoid trimming too large a portion (such 20%) at both ends
• May result in the loss of valuable information
Median
• Middle value in a set of ordered data values
• Separates the higher half from lower half
• for skewed asymmetric data,
• Median is the better measure of the center data
• Median is applied to
• Numerical data
• Ordinal data
Suppose,
• A given dataset of N values for an attribute X is sorted in ascending order
• If N is Odd
• The median is the middle value
• If N is Even
• Median
• Average of two middlemost values
Eg:
Median=
Median=
Median
• Expensive for large number of observations
• If data are grouped in intervals according to their frequency,
• Median approximation can be done via interpolation
Mode
• Another measure of central tendency
• A value that occurs most frequently compared to all neighboring values
• Can be used in qualitative & quantitative attributes
• Data set with one, two, three modes are called
• Unimodal
• Bimodal
• Trimodal
• Two or more modes is multimodal
• Example.2.8. Mode
• The modes are 52 and 70, hence, bimodal
• For unimodal numeric data that are moderately skewed, we have the following empirical
relation
• For unimodal numeric data that are moderately skewed
Midrange
• Also used to assess the central tendency of a numeric data set
• Is the average of the largest and smallest values in the set
• Easy to compute using SQL aggregate functions min(), max()
Symmetric data
Positively skewed (right skewed) asymmetric data
negatively skewed (left skewed) asymmetric data
Measuring the dispersion of data
• Measures the dispersion or spread of numeric data
• Measures include
• Range
• Quantiles
• Percentiles
• Interquartile range
• Variance
• Standard deviation
• The five-number summary can be displayed as
• Boxplot
• Useful in identifying outliers
Range, quartiles and interquartile range
• Range
• Refers to the difference b/w the smallest and largest values in the dataset
• Quantiles
• Splits data distribution into equal-size consecutive sets
• Quantiles are points taken at regular intervals of data distribution
• The Kth q-quantile is the value x such that
• Most k/q of the data values are less than x
• Most (q-k)/q of the data values are more than x
• Where 0<k<q
• There are q-1 q-quantiles
• Quantiles
• Splits data distribution into equal-size consecutive sets
• 2-quantiles is a data point
• divides the lower and upper halves of the data distribution corresponding to median
• 4-quantiles (commonly referred as quartiles)
• three data points that split data distribution into four equal parts
• 100-quantiles(commonly referred as percentiles)
• Divide data distribution into 100 equal-size consecutive sets
• Median, quartiles, percentiles are the most widely used forms of quantiles
Interquartile Range (IQR)
• Distance between first and third quartiles
• Measure of spread
• Gives the range middle half of the data
• Can be defined as
Interquartile Range (IQR)
Interquartile Range (IQR)
• Common rule of thumb for identifying suspected outliers
• Any data point smaller than Q1-1.5*IQR is considered a suspected outlier
• Any data point larger than Q3+1.5*IQR is considered a suspected outlier
Five-number summary, boxplots and outliers
• Q1, median and Q3 together contain no information about end points (e.g. tails)
• Full summary of shape is obtained by providing lowest & highest data values as well
• Known as five-number summary
• Five-number summary of a distribution consists of
• Median(Q2)
• Q1 and Q3
• Minimum
• Maximum
• Boxplots
• Popular way of visualizing a distribution
• Incorporates five-number summary
• Ends of the box are at the quartiles so that the box length is IQR
• Median is marked by a line within a box
• Two lines (called whiskers) outside the box extend to
• Smallest & largest observations
• Whiskers are extended to the extreme low and high observations
• Only if
• The values are less than 1.5*IQR beyond the quartiles
• Otherwise,
• Whisker terminate at the most extreme observations occurring with 1.5*IQR
Variance and Standard Deviation
• Variance and standard deviation are measures of data dispersion
• Indicate the distribution of data
• A low standard deviation means
• that the data observations tend to be very close to mean
• A high standard deviation indicates
• That the data are spread out over a large range of values
• The variance of N observations for a numeric attribute X
• Where,
is the mean value of the observations
• The standard deviation, is the square root of variance,
• Basic properties of standard deviation,
• Measures the spread about mean
• Should be consider only when mean is chosen as measure of center
• =0 only when there is no spread, means all observations have the same value
• >0 otherwise
• A small standard deviation means most data points are close to the mean
• A large standard deviation means data points are spread out
• An observation is unlikely to be more than several standard deviations away from mean
• Chebyshev’s inequality measure shows,
• At least of the data points are within k standard deviations of
the mean
• k is a number greater than 1,
• Example
Covariance and Correlation Analysis
• In probability theory and statistics
• Correlation & covariance are two similar measures for assessing
• How much two attributes change together
Covariance of numeric data
• Consider two numeric attributes A and B
• a set of n real-valued observations
• Mean values of A & B,
• Are known as expected values on A and B
• Covariance between A and B is defined as
• Can also be shown as
• If value of of A is large than and
• Value of of attribute B is likely to larger than
• Then, the attributes A and B are tend to change together
• Therefore, the covariance b/w A and B is Positive
• If one of the attributes tends to be above its expected value
• When the other attribute is below its expected value,
• Then the covariance of A and B is negative
• If A and B are independent (they do not have correlation)
• Means their values do not influence each other
• Which can be mathematically denoted as
• means
• Thus, when A and B are independent, their covariance is always zero
• But sometimes, A and B may result in 0 covariance, but still indicate some
• dependencies
• Variance is a special case of covariance
• Where, the two attributes are identical
• If two attributes A and B are the same,
• This simplifies to
Correlation coefficient for numeric data
• Evaluates the correlation between two numeric attributes, A and B
• Also known as Pearson’s product moment coefficient
=55500
The standard deviation, , is the root of
variance
X2 correlation test for nominal data
• For nominal data,
• Correlation between two attributes A and B can be discovered by a X2 test
• Oij is the observed frequency of joint event (Ai and Bi)
• Eij is the expected frequency of (Ai,Bj)
• X2 tests the hypothesis that A and B are independent
• For this 2*2 table,
• The degree of freedom are (2-1)*(2-1)=1
• For 1 degree of freedom,
• The X2 value needed to reject the hypothesis,
• At 0.001 significance level is 10.828
• Thus, we can reject the hypothesis that gender and preferred_reading are
independent
• Conclude the two attributes are correlated
Graphic displays of basic statistics of data
• Graphic displays of basic statistical descriptions include
• Quantile plots
• Quantile-quantile plots
• Histograms
• Scatter plots
• These graphs are helpful for the visual inspection of data
• Useful for preprocessing
Quantile plot
• Simple and effective way to look at a univariate data distribution
• Displays all of the data for the given attribute
• Plots quantile information
• Data will be sorted in ascending order
• Let xi, for i=1 to N (data in ascending order)
• Each observation xi is paired with a percentage fi
• Indicates fi*100% of the data are below the value xi
Quantile plot
• Numbers increase in equal steps of 1/N
• Ranging from 1/2n to 1-(1/2n)
A quantile plot for the unit price data
Quantile-quantile plot (q-q plot)
• Graphs quantiles of one univariate distribution against another
• Powerful visualization tool
• Allows user to view whether there is a shift in going from one distribution to another
• Suppose we have two sets of observations for the attribute/variable ‘Unit price’
• Taken from two different branch locations
• Let,
• be the data from first branch
• be the data from second, where each data set is sorted in
ascending order
• If M=N , then we simply plot against ,
• Where and are both
A q-q plot for unit price data from
two branches of the online store
Histograms
• ‘Histos’ means pole or mast
• ‘gram’ means chart
• Histogram is a chart of pole
• According to the number of poles desired
• The range of values for X is partitioned into a set of disjoint consecutive subranges
• Subranges referred to as buckets or bins
• Range of bucket is known as width
• Eg: a price attribute with a value range of $1-$200 can be partitioned into subranges
• 1-20, 21-40, 41-60 …181-200
• Each subrange bar is drawn with a height
• Represents total count of items observed within the subrange
• Represents data skew
• Different form bar chart
• Used to represent set of categorical data & its size
Scatter plots and data correlation
• Determines
• Relationship
• Pattern
• Trend
• b/w two numeric attributes
• Pair of values treated as pair of coordinates
Scatter plots and data correlation
• Determines
• Relationship
• Pattern
• Trend
• b/w two numeric attributes
• Pair of values treated as pair of coordinates
• Useful method for providing a first look at bivariate data to see
• Cluster of points
• Outliers
• Explore the possibility of correlation relationships
• If two attributes are correlated,
• The knowledge of one attribute enables to predict the other attribute
• No correlation
• Can be extended to n attributes, resulting in a scatter-plot matrix
Similarity and Distance Measures
• Referred to as measures of proximity
• Similarity measure for two objects i and j, returns
• 0 if objects are unalike
• 1 if objects are identical
• Dissimilarity measure for two objects i and j, returns
• 0 if objects are same
• 1, if objects are dissimilar
• Two types of data structures are commonly in DM
• Data matrix (used to store data objects)
• Dissimilarity matrix(used to store dissimilarity values for pair of objects)
Data matrix vs. Dissimilarity matrix
• An object can be described by multiple attributes
• Data matrix (object-by-attribute)
• Stores the n data objects in the form of a relational table/ n-by-p matrix
• Row corresponds to an object
• Column corresponds to an object
• Dissimilarity Matrix (object-by-object structure)
• Stores a collection of proximities
• Available for all pairs of n objects
• It is often represented by an n-by-n table
• Where d(i,j) is the dissimilarity b/w objects i and j
• d(i,j) is a non-negative number
• Close to 0 when objects i and j are highly similar /’near’ other
• Becomes larger the more they differ
• d(i,i)=0
• d(i,j)=d(j,i)
• Measures of similarity can be expressed as a function of measures of dissimilarity
• Where, sim(i,j) is the similarity between two objects i and j
• Data matrix is made up of two entities
• Rows
• Columns
• Thus, called as two-mode matrix
• Dissimilarity matrix contains only dissimilarity values
• Thus, called a one-node matrix
Proximity measures for nominal attributes
• Dissimilarity between two objects i and j can be computed as
• Where,
• m is the number of matches
• p is the total number of attributes
• Similarly, similarity can be computed as
Proximity measures for binary attributes
• If all binary attributes have same weight, 2*2 contingency table is used
Obj 1 obj2
q 1 1
r 1 0
s 0 1
t 0 0
• Where,
• p=q+r+s+t
• In symmetric binary attributes
• Each state is equally valuable
• Dissimilarity that is based on symmetric binary attributes is called
• Symmetric binary dissimilarity
• If objects i and j are described by symmetric binary attributes, then dissimilarity b/w i & j
• For asymmetric binary attributes, the two states are not equally important,
• Positive (1)
• Negative (0)
• Agreement of two 1s is considered more significant that two 0s
• Having one state is considered as monary
• Dissimilarity based on these attributes is called asymmetric binary dissimilarity
• Asymmetric binary similarity between the objects i and j can be computed as
• is called Jaccard coefficient
Dissimilarity of numeric data: Minkowski distance
• Numeric distance measures
• Euclidean
• Manhattan
• Minkowski
• Data are normalized before applying distance calculations
• [-1,1]
• [0,1]
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• When
• h=1, Manhattan distance
• h=2, Euclidean distance
• h=∞, Chebyshev distance
• Supremum Distance
• Also called
• Weighted Euclidean Distance
Proximity measures for ordinal attributes
• Have a meaningful order or ranking
• Magnitude b/w successive values is unknown
• Eg:
• Cold temperature (-30 to -10)
• Moderate temperature (-10 to 10)
• Warm temperature (10 to 30)
• Let Mf represent the number of possible states
• With states ranking 1,….,Mf
• Have a meaningful order or ranking
• Magnitude b/w successive values is unknown
• Eg:
• Cold temperature (-30 to -10)
• Moderate temperature (-10 to 10)
• Warm temperature (10 to 30)
• Let Mf represent the number of possible states
• With states ranking 1,….,Mf