Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views124 pages

Topic3 Data Types

Uploaded by

oh nambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views124 pages

Topic3 Data Types

Uploaded by

oh nambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 124

Data Types

• Data sets are made up of data objects

• Data objects also referred as

• Samples

• Examples

• Instances

• Data points

• Objects

• tuples

• Data objects are described by attributes


• Rows in a database correspond to data object

• Columns correspond to attributes


• What is an attribute?

• A data field

• Representing a characteristic /feature of a data object

• Synonyms

• Dimension (used in data warehousing)

• Feature ( ML)

• Variable(statistician)
• DM and data base professionals use the term attribute

• Attributes describing customer (Customer_ID, name, address)

• Set of attributes that describes a given object is called

• Attribute vector/feature vector

• If distribution of dataset involves only one attribute (height) – univariate

• If distribution of dataset involves two attributes (height, weight)-bivariate


• Type of attributes

• Nominal

• Binary

• Ordinal

• Numeric
Nominal

• Means ‘relating to names’

• The values are symbols or names of things

• Each value represents

• Category

• Code

• State

• Also called ‘categorical’

• In computer science, it is called ‘enumerations’


Example 2.1. Nominal attributes

• Attribute name: hair_color


• Attribute values: black, brown, blond, red, gray, white

• Attribute name: marital_status


• Attribute values: single, married, divorced, widowed

• Attribute name: occupation


• Attribute values: teacher, dentist, programmer, farmer etc.,

• Also possible to represent symbols /names with numbers

• Attribute name: hair_color


• Attribute values: black(0), brown(1), blond(2), red(3), gray(4), white(5)

• Attribute name:customer_id
• Attribute values: 1,2,3,.. , etc
• The numbers are not to be used quantitatively

• Mathematical operations on values of nominal attributes are not meaningful

• It makes no sense to subtract one CUSTOMER_ID from another

• Nominal attribute may have integers as values

• It is not considered as a numeric values

• Coz, nominal values do not have any meaningful order & are not quantitative

• It makes no sense to find the mean, median for nominal attributes

• One thing that is of interest, is most commonly occurring value, that is mode

• Is a measure of tendency
Binary Attributes

• Is a nominal attribute with only two categories or states: 0 or 1

• 0 typically means that the attribute is absent

• 1 means that it is present

• Binary attributes are referred to as ‘Boolean’ if the two states correspond to

• True/false
Example 2.2: Binary Attributes
• Attribute name: smoker
• Attribute value: 1 (indicates that the patient smokes)
• Attribute value: 0(indicates that the patient does not)

• Attribute name: medical_test


• Attribute value:1 (result of the test is positive)
• Attribute value:0(result of the test is negative)

• A binary attribute is symmetric


• If
• states are equally important
• Carry the same weight

• Eg:
• Attribute Name: Gender
• Attribute Values: 1(Male); 0(Female)
• A binary attribute is asymmetric
• If
• Two states are not equally important
• One state is usually rarer or more significant than the other

• By convention, the rarer /the most significant value is coded as 1, other as 0

• Eg:
• Attribute name: HIV
• Attribute values: 1(HIV Positive), 0(HIV negative)
Ordinal Attributes

• have a meaningful order/ranking among the values

• Magnitude b/w successive values is not known

• Example 2.3: ordinal attributes

• Attribute Name: dress_size


• Attribute values: S, M, L, XL,XXL

• Attribute Name: grade


• Attribute values:S,A+,A,B+,B., etc.,

• Attribute Name:professional_rank
• Attribute values: assistant, associate, professor
• Attribute name: sat_survey
• Attribute values: very dissatisfied, dissatisfied, neutral, satisfied, and very satisfied

• The values have a meaningful sequence

• We cannot tell from the values “how much bigger”

• ordinal attributes are obtained from discretization of numeric quantities

• By splitting the value range into finite number of ordered categories

• Central tendency of an ordinal attribute can be represented by

• Mode

• Median

• Nominal, binary and ordinal attributes are qualitative


• Numeric Attributes

• Quantitative

• Represented in integer or real values

• Attributes can be interval-scaled/ratio-scaled

• Interval-scaled attributes

• Measures on a scale of equal-size units

• Values have an order

• Can be positive , 0, negative

• Allows to compare and quantify the difference b/w values


• Example 2.4. Interval-scaled attributes

• value ‘zero’ doesn’t mean nothing

• Can calculate difference between values

• Can compute mean, median and mode

• 0°C doesn’t mean "no temperature“ – just a reference point

• The difference between 10°C and 20°C (10°C) is meaningful

• but you can’t say 20°C is "twice as hot" as 10°C

• The difference between 3:00 and 4:00 is 1 hour

• but saying "4:00 is twice as much as 2:00" doesn’t make sense


• Ratio-scaled attributes

• The zero value means ‘nothing’ or ‘absence of the quantity’

• Can calculate both differences & ratios

• Can compute mean, median, mode & multiplicative measures like ratio

• Example

• 0K means ‘no thermal energy’

• 0kg means no weight

• 100 kg is twice as 50 kg

• $0 means no money; $100 is 100 times more than $1


• Discrete Vs. Continuous attributes

• Discrete attribute

• Has a finite or countably infinite number of values

• May or may not be presented as integers

• Eg: hair_color has finite number of values, thus discrete


Statistics of Data

• Basic statistical methods

• identifies properties of data

• Highlights which data values should be treated as noise/outliers

• Measures of central tendency

• Mean

• Median

• Mode

• Midrange
• Dispersion of data

• Range

• Quartiles

• Interquartile range

• Five number summary & boxplots

• Variance

• Standard deviation
• Description of relations among multiple variables

• For numerical data

• Co-variance

• Correlation coefficient

• For nominal data

• correlation
• Statistical /graphical data presentation software packages includes

• Bar charts

• Pie charts

• Line graphs

• Data summaries and distributions packages include

• Quantile plots

• Quantile-quantile plots

• Histograms

• Scatter plots
Measuring the central tendency

Mean

• Measure of ‘center’

• Where,

- Set of N values /observations

• Built-in aggregate function is average, (avg()in SQL)


Weighted arithmetic mean or the weighted average
• Weighted arithmetic mean or the weighted average

(50×0.10)+(60×0.05)+(70×0.15)+(80×0.20)+(90×0.25)+(85×0.10)+(75×0.05)+(95×0.05)+(65×0.03)+(55×0.02)
0.10+0.05+0.15+0.20+0.25+0.10+0.05+0.05+0.03+0.02

77.05
=77.05
1.00
• Problem with mean
• Sensitivity to outliers, Extreme values corrupt the mean
• Salary drawn by ten employees
Trimmed mean
• Should avoid trimming too large a portion (such 20%) at both ends

• May result in the loss of valuable information


Median

• Middle value in a set of ordered data values

• Separates the higher half from lower half

• for skewed asymmetric data,

• Median is the better measure of the center data


• Median is applied to
• Numerical data
• Ordinal data

Suppose,
• A given dataset of N values for an attribute X is sorted in ascending order
• If N is Odd
• The median is the middle value
• If N is Even
• Median
• Average of two middlemost values
Eg:

Median=

Median=
Median

• Expensive for large number of observations

• If data are grouped in intervals according to their frequency,

• Median approximation can be done via interpolation


Mode

• Another measure of central tendency

• A value that occurs most frequently compared to all neighboring values

• Can be used in qualitative & quantitative attributes

• Data set with one, two, three modes are called

• Unimodal

• Bimodal

• Trimodal

• Two or more modes is multimodal


• Example.2.8. Mode

• The modes are 52 and 70, hence, bimodal

• For unimodal numeric data that are moderately skewed, we have the following empirical
relation
• For unimodal numeric data that are moderately skewed
Midrange

• Also used to assess the central tendency of a numeric data set

• Is the average of the largest and smallest values in the set

• Easy to compute using SQL aggregate functions min(), max()


Symmetric data
Positively skewed (right skewed) asymmetric data
negatively skewed (left skewed) asymmetric data
Measuring the dispersion of data

• Measures the dispersion or spread of numeric data

• Measures include
• Range
• Quantiles
• Percentiles
• Interquartile range
• Variance
• Standard deviation

• The five-number summary can be displayed as

• Boxplot

• Useful in identifying outliers


Range, quartiles and interquartile range

• Range

• Refers to the difference b/w the smallest and largest values in the dataset
• Quantiles

• Splits data distribution into equal-size consecutive sets

• Quantiles are points taken at regular intervals of data distribution

• The Kth q-quantile is the value x such that

• Most k/q of the data values are less than x

• Most (q-k)/q of the data values are more than x

• Where 0<k<q

• There are q-1 q-quantiles


• Quantiles

• Splits data distribution into equal-size consecutive sets

• 2-quantiles is a data point


• divides the lower and upper halves of the data distribution corresponding to median

• 4-quantiles (commonly referred as quartiles)


• three data points that split data distribution into four equal parts

• 100-quantiles(commonly referred as percentiles)


• Divide data distribution into 100 equal-size consecutive sets

• Median, quartiles, percentiles are the most widely used forms of quantiles
Interquartile Range (IQR)

• Distance between first and third quartiles

• Measure of spread

• Gives the range middle half of the data

• Can be defined as
Interquartile Range (IQR)
Interquartile Range (IQR)

• Common rule of thumb for identifying suspected outliers

• Any data point smaller than Q1-1.5*IQR is considered a suspected outlier

• Any data point larger than Q3+1.5*IQR is considered a suspected outlier


Five-number summary, boxplots and outliers

• Q1, median and Q3 together contain no information about end points (e.g. tails)

• Full summary of shape is obtained by providing lowest & highest data values as well

• Known as five-number summary

• Five-number summary of a distribution consists of

• Median(Q2)

• Q1 and Q3

• Minimum

• Maximum
• Boxplots

• Popular way of visualizing a distribution

• Incorporates five-number summary

• Ends of the box are at the quartiles so that the box length is IQR

• Median is marked by a line within a box

• Two lines (called whiskers) outside the box extend to

• Smallest & largest observations


• Whiskers are extended to the extreme low and high observations

• Only if

• The values are less than 1.5*IQR beyond the quartiles

• Otherwise,

• Whisker terminate at the most extreme observations occurring with 1.5*IQR


Variance and Standard Deviation

• Variance and standard deviation are measures of data dispersion

• Indicate the distribution of data

• A low standard deviation means

• that the data observations tend to be very close to mean

• A high standard deviation indicates

• That the data are spread out over a large range of values
• The variance of N observations for a numeric attribute X

• Where,

is the mean value of the observations

• The standard deviation, is the square root of variance,


• Basic properties of standard deviation,

• Measures the spread about mean

• Should be consider only when mean is chosen as measure of center

• =0 only when there is no spread, means all observations have the same value

• >0 otherwise
• A small standard deviation means most data points are close to the mean

• A large standard deviation means data points are spread out

• An observation is unlikely to be more than several standard deviations away from mean

• Chebyshev’s inequality measure shows,

• At least of the data points are within k standard deviations of


the mean

• k is a number greater than 1,


• Example
Covariance and Correlation Analysis

• In probability theory and statistics

• Correlation & covariance are two similar measures for assessing

• How much two attributes change together

Covariance of numeric data

• Consider two numeric attributes A and B

• a set of n real-valued observations


• Mean values of A & B,

• Are known as expected values on A and B

• Covariance between A and B is defined as

• Can also be shown as


• If value of of A is large than and

• Value of of attribute B is likely to larger than

• Then, the attributes A and B are tend to change together

• Therefore, the covariance b/w A and B is Positive

• If one of the attributes tends to be above its expected value

• When the other attribute is below its expected value,

• Then the covariance of A and B is negative


• If A and B are independent (they do not have correlation)

• Means their values do not influence each other

• Which can be mathematically denoted as

• means

• Thus, when A and B are independent, their covariance is always zero

• But sometimes, A and B may result in 0 covariance, but still indicate some

• dependencies
• Variance is a special case of covariance

• Where, the two attributes are identical

• If two attributes A and B are the same,

• This simplifies to
Correlation coefficient for numeric data

• Evaluates the correlation between two numeric attributes, A and B

• Also known as Pearson’s product moment coefficient


=55500

The standard deviation, , is the root of


variance
X2 correlation test for nominal data

• For nominal data,

• Correlation between two attributes A and B can be discovered by a X2 test

• Oij is the observed frequency of joint event (Ai and Bi)

• Eij is the expected frequency of (Ai,Bj)

• X2 tests the hypothesis that A and B are independent


• For this 2*2 table,

• The degree of freedom are (2-1)*(2-1)=1

• For 1 degree of freedom,

• The X2 value needed to reject the hypothesis,

• At 0.001 significance level is 10.828

• Thus, we can reject the hypothesis that gender and preferred_reading are
independent

• Conclude the two attributes are correlated


Graphic displays of basic statistics of data

• Graphic displays of basic statistical descriptions include

• Quantile plots

• Quantile-quantile plots

• Histograms

• Scatter plots

• These graphs are helpful for the visual inspection of data

• Useful for preprocessing


Quantile plot

• Simple and effective way to look at a univariate data distribution

• Displays all of the data for the given attribute

• Plots quantile information

• Data will be sorted in ascending order

• Let xi, for i=1 to N (data in ascending order)

• Each observation xi is paired with a percentage fi

• Indicates fi*100% of the data are below the value xi


Quantile plot

• Numbers increase in equal steps of 1/N

• Ranging from 1/2n to 1-(1/2n)

A quantile plot for the unit price data


Quantile-quantile plot (q-q plot)

• Graphs quantiles of one univariate distribution against another

• Powerful visualization tool

• Allows user to view whether there is a shift in going from one distribution to another
• Suppose we have two sets of observations for the attribute/variable ‘Unit price’

• Taken from two different branch locations

• Let,

• be the data from first branch

• be the data from second, where each data set is sorted in


ascending order

• If M=N , then we simply plot against ,

• Where and are both


A q-q plot for unit price data from
two branches of the online store
Histograms

• ‘Histos’ means pole or mast

• ‘gram’ means chart

• Histogram is a chart of pole

• According to the number of poles desired

• The range of values for X is partitioned into a set of disjoint consecutive subranges

• Subranges referred to as buckets or bins

• Range of bucket is known as width


• Eg: a price attribute with a value range of $1-$200 can be partitioned into subranges

• 1-20, 21-40, 41-60 …181-200

• Each subrange bar is drawn with a height

• Represents total count of items observed within the subrange

• Represents data skew

• Different form bar chart

• Used to represent set of categorical data & its size


Scatter plots and data correlation

• Determines

• Relationship

• Pattern

• Trend

• b/w two numeric attributes

• Pair of values treated as pair of coordinates


Scatter plots and data correlation

• Determines

• Relationship

• Pattern

• Trend

• b/w two numeric attributes

• Pair of values treated as pair of coordinates


• Useful method for providing a first look at bivariate data to see

• Cluster of points

• Outliers

• Explore the possibility of correlation relationships

• If two attributes are correlated,

• The knowledge of one attribute enables to predict the other attribute


• No correlation

• Can be extended to n attributes, resulting in a scatter-plot matrix


Similarity and Distance Measures

• Referred to as measures of proximity

• Similarity measure for two objects i and j, returns

• 0 if objects are unalike

• 1 if objects are identical

• Dissimilarity measure for two objects i and j, returns

• 0 if objects are same

• 1, if objects are dissimilar


• Two types of data structures are commonly in DM

• Data matrix (used to store data objects)

• Dissimilarity matrix(used to store dissimilarity values for pair of objects)


Data matrix vs. Dissimilarity matrix

• An object can be described by multiple attributes

• Data matrix (object-by-attribute)

• Stores the n data objects in the form of a relational table/ n-by-p matrix

• Row corresponds to an object

• Column corresponds to an object


• Dissimilarity Matrix (object-by-object structure)

• Stores a collection of proximities

• Available for all pairs of n objects

• It is often represented by an n-by-n table

• Where d(i,j) is the dissimilarity b/w objects i and j


• d(i,j) is a non-negative number

• Close to 0 when objects i and j are highly similar /’near’ other

• Becomes larger the more they differ

• d(i,i)=0

• d(i,j)=d(j,i)

• Measures of similarity can be expressed as a function of measures of dissimilarity

• Where, sim(i,j) is the similarity between two objects i and j


• Data matrix is made up of two entities

• Rows

• Columns

• Thus, called as two-mode matrix

• Dissimilarity matrix contains only dissimilarity values

• Thus, called a one-node matrix


Proximity measures for nominal attributes

• Dissimilarity between two objects i and j can be computed as

• Where,

• m is the number of matches

• p is the total number of attributes

• Similarly, similarity can be computed as


Proximity measures for binary attributes

• If all binary attributes have same weight, 2*2 contingency table is used

Obj 1 obj2
q 1 1
r 1 0
s 0 1
t 0 0

• Where,

• p=q+r+s+t
• In symmetric binary attributes

• Each state is equally valuable

• Dissimilarity that is based on symmetric binary attributes is called

• Symmetric binary dissimilarity

• If objects i and j are described by symmetric binary attributes, then dissimilarity b/w i & j
• For asymmetric binary attributes, the two states are not equally important,

• Positive (1)

• Negative (0)

• Agreement of two 1s is considered more significant that two 0s

• Having one state is considered as monary

• Dissimilarity based on these attributes is called asymmetric binary dissimilarity


• Asymmetric binary similarity between the objects i and j can be computed as

• is called Jaccard coefficient


Dissimilarity of numeric data: Minkowski distance

• Numeric distance measures

• Euclidean

• Manhattan

• Minkowski

• Data are normalized before applying distance calculations

• [-1,1]
• [0,1]
• Euclidean Distance

• Manhattan Distance
• Minkowski Distance

• When

• h=1, Manhattan distance

• h=2, Euclidean distance

• h=∞, Chebyshev distance


• Supremum Distance

• Also called
• Weighted Euclidean Distance
Proximity measures for ordinal attributes

• Have a meaningful order or ranking

• Magnitude b/w successive values is unknown

• Eg:

• Cold temperature (-30 to -10)

• Moderate temperature (-10 to 10)

• Warm temperature (10 to 30)

• Let Mf represent the number of possible states

• With states ranking 1,….,Mf


• Have a meaningful order or ranking

• Magnitude b/w successive values is unknown

• Eg:

• Cold temperature (-30 to -10)

• Moderate temperature (-10 to 10)

• Warm temperature (10 to 30)

• Let Mf represent the number of possible states

• With states ranking 1,….,Mf

You might also like