Module 1
Module 1
Class - T.E.
Course Code – CSDLO5011
By
Prof. A.V.Phanse
Exploratory Data Analysis
In 1962, John W. Tukey proposed a new scientific discipline called data analysis
that included statistical inference as just one component.
Tukey presented simple plots (e.g., boxplots, scatterplots) that, along with
summary statistics (mean, median, quantiles etc.), help paint a picture of a
data set.
John Tukey
Elements of Structured Data
Data comes from many sources: sensor measurements, events, text, images, and
videos etc.
Much of this data is unstructured: images are a collection of pixels, with each
pixel containing RGB (red, green, blue) color information. Texts are sequences of
words and nonword characters, often organized by sections, subsections, and so
on. Clickstreams are sequences of actions by a user interacting with an app or a
web page.
A major challenge of data science is to harness this raw data into actionable
information.
To apply the statistical concepts, unstructured raw data must be processed and
manipulated into a structured form.
One of the common forms of structured data is a table with rows and columns
There are two basic types of structured data: numeric and categorical.
Numeric data comes in two forms: continuous, such as wind speed or time
duration, and discrete, such as the count of the occurrence of an event.
Categorical data takes only a fixed set of values, such as a type of TV screen
(plasma, LCD, LED, etc.) or a state name ( Maharashtra, Goa etc.).
Binary data is an important special case of categorical data that takes on only
one of two values, such as 0/1, yes/no, or true/false.
Another useful type of categorical data is ordinal data in which the categories
are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5).
Rectangular Data
Rectangular data is the general term for a two-dimensional matrix with rows
indicating records (cases) and columns indicating features (variables).
The data doesn’t always start in this form: unstructured data must be processed
and manipulated so that it can be represented as a set of features in the
rectangular data.
Data in relational databases must be extracted and put into a single table for
most data analysis and modeling tasks.
Feature –
A column within a table is commonly referred
to as a feature
Records –
A row within a table is commonly referred to
as a record.
Nonrectangular Data Structures
There are other data structures besides rectangular data.
Time series data records successive measurements of the same variable. It is the
raw material for statistical forecasting methods, and it is also a key component
of the data produced by devices—the Internet of Things.
Spatial data structures, which are used in mapping and location analytics, are
more complex and varied than rectangular data structures.
In the object representation, the focus of the data is an object (e.g., a house)
and its spatial coordinates. The field view, by contrast, focuses on small units of
space and the value of a relevant metric (e.g. pixel brightness).
Graph (or network) data structures are used to represent physical, social, and
abstract relationships. For example, a graph of a social network, such as
Facebook or LinkedIn, may represent connections between people on the
network.
Mean –
For calculating the mean of grouped data, we calculate the class mark. For this, the
midpoints of the class intervals are calculated as:
Trimmed Mean –
A variation of the mean is a trimmed mean, which you calculate by dropping a
fixed number of sorted values at each end and then taking an average of the
remaining values.
For example, in a dance competition the top score and bottom score from five
judges are dropped, and the final score is the average of the scores from the
three remaining judges.
This makes it difficult for a single judge to manipulate the score, perhaps to
favor any contestant.
Weighted mean –
Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight.
For example, if we are taking the average from multiple sensors and one of the
sensors is less accurate, then we might downweight the data from that sensor.
In a class of 30 students, marks obtained by students in mathematics out of 50 is
tabulated below. Calculate the mean of the data.
If there is an even number of data values, the middle value is one that is not
actually in the data set, but rather the average of the two values that divide the
sorted data into upper and lower halves.
Compared to the mean, which uses all observations, the median depends only
on the values in the center of the sorted data.
While this might seem to be a disadvantage, since the mean is much more
sensitive to the data, there are many instances in which the median is a better
metric for location.
An outlier is any value that is very distant from the other values in a data set.
Presence of an outlier does not make a data value invalid. Still, outliers are often
the result of data errors such as mixing data of different units (kilometers versus
meters) or bad readings from a sensor.
When outliers are the result of bad data, the mean will result in a poor estimate
of location, while the median will still be valid.
In any case, outliers should be identified and are usually worthy of further
investigation.
The median is not the only robust estimate of location. In fact, a trimmed mean
is widely used to avoid the influence of outliers.
For example, trimming the bottom and top 10% of the data will provide
protection against outliers in all but the smallest data sets.
At the heart of statistics lies variability: measuring it, reducing it, distinguishing
random from real variability, identifying the various sources of real variability,
and making decisions in the presence of it.
The most widely used estimates of variation are based on the differences, or
deviations, between the estimate of location and the observed data.
For a set of data {1, 4, 4}, the mean is 3 and the median is 4.
The deviations from the mean are the differences:
1 – 3 = –2, 4 – 3 = 1, 4 – 3 = 1.
These deviations tell us how dispersed the data is around the central value.
One way to measure variability is to estimate a typical value for these
deviations.
Averaging the deviations themselves would not tell us much. The negative
deviations offset the positive ones.
In fact, the sum of the deviations from the mean is precisely zero.
Instead, a simple approach is to take the average of the absolute values of the
deviations from the mean.
In the preceding example, the absolute value of the deviations is {2 1 1}, and
their average is (2 + 1 + 1) / 3 = 1.33.
This is known as the mean absolute deviation and is computed with the
formula:
The best-known estimates of variability are the variance and the standard
deviation, which are based on squared deviations.
Variance
The variance is an average of the squared deviations, and the standard deviation is
the square root of the variance
The standard deviation is much easier to interpret than the variance since it is
on the same scale as the original data.
Neither the variance, the standard deviation, nor the mean absolute deviation
is robust to outliers and extreme values
The variance and standard deviation are especially sensitive to outliers since
they are based on the squared deviations.
A robust estimate of variability is the median absolute deviation from the median
or MAD:
where m is the median. Like the median, the MAD is not influenced by extreme values.
Estimates Based on Percentiles
The most basic measure is the range: the difference between the largest and
smallest numbers.
The minimum and maximum values themselves are useful to know and are
helpful in identifying outliers.
But, the range is extremely sensitive to outliers and not very useful as a general
measure of dispersion in the data.
To avoid the sensitivity to outliers, we can look at the range of the data after
dropping values from each end. Formally, these types of estimates are based on
differences between percentiles.
Quartiles:
Data can be divided into four regions that cover the total range of observed values.
Cut points for these regions are known as quartiles.
Q1 is the median of the first half of the ordered observations and Q3 is the
median of the second half of the ordered observations.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
In the above example Q1= ((15+1)/4)1 =4th observation of the data. The 4th
observation is 11. So Q1 is of this data is 11.
Inter-quartile Range:
Difference between Q3 and Q1.
Inter-quartile range of the previous example is 61- 11=50.
The middle half of the ordered data lie between 11 and 61.
Deciles:
If data is ordered and divided into 10 parts, then cut points are called Deciles
Percentiles:
If data is ordered and divided into 100 parts, then cut points are called
Percentiles.
25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile
of the data is Q3.
In notations, percentiles of a data is the ((n+1)/100)p th observation of the data,
where p is the desired percentile and n is the number of observations of data.
Coefficient of Variation:
3. The mean of the following distribution is 26. Find the value of p and also the
value of the observation.
4. If a die is rolled, then find the variance and standard deviation of the possibilities.
5. Find the standard deviation of the average temperatures recorded over a five
day period last winter: 18, 22, 19, 25, 12
6. A survey of 36 students of a class was done to find out the mode of transport
used by them while commuting to the school. The collected data is shown in the
table given below. Represent the data in the form of a bar graph.
7. Construct a frequency distribution table for the following weights (in gm) of 30
oranges using the equal class intervals, one of them is 40-45 (45 not included).
The weights are: 31, 41, 46, 33, 44, 51, 56, 63, 71, 71, 62, 63, 54, 53, 51, 43, 36, 38,
54, 56, 66, 71, 74, 75, 46, 47, 59, 60, 61, 63.
(a) What is the class mark of the class intervals 50-55?
(b) What is the range of the above weights?
(c) How many class intervals are there?
(d) Which class interval has the lowest frequency?
Exploring the Data Distribution
Each of the estimates we’ve covered sums up the data in a single number to
describe the location or variability of the data.
It is also useful to explore how the data is overall distributed.
Boxplots, introduced by Tukey (1977), are based on percentiles and give a quick
way to visualize the distribution of data.
From this boxplot we can immediately see that the
median state population is about 5 million, half the
states fall between about 2 million and about 7
million, and there are some high population outliers.
The top and bottom of the box are the 75th and
25th percentiles, respectively.
The median is shown by the horizontal line in the
box.
The dashed lines, referred to as whiskers, extend
from the top and bottom of the box to indicate the
range for the bulk of the data.
Boxplot of state populations
Box Plot:
A frequency table shows how many times each value appears in a dataset.
It’s a useful way to organize and summarize data. However, it can be harder to
see patterns in the data just by looking at a frequency table.
Each bar represents a range of values, and the height of the bar represents the
frequency of values in that range.
This can make it easier to see patterns in the data, such as if the data is evenly
distributed or skewed to one side.
Suppose we collect the exam scores of 20 students in some class:
Scores: 50, 58, 62, 65, 70, 71, 72, 74, 74, 78, 81, 82, 82, 85, 87, 88, 89, 92, 94, 96
The x-axis of the histogram displays bins of data values and the y-axis tells us how
many observations in a dataset fall in each bin.
Density Plots and Estimates
A density plot shows the distribution of data values as a continuous line.
A density plot can be thought of as a smoothed histogram, although it is typically
computed directly from the data through a kernel density estimate.
A density plot corresponds to plotting the histogram as a proportion rather than
counts.
Note that the total area under the density curve = 1 or 100 %, and instead of
counts in bins you calculate areas under the curve between any two points on the
x-axis, which correspond to the proportion of the distribution lying between
those two points.
Advantages of Density Plots
Density plots give actual In density plot, there is no When working with large
distribution of data values limitation on number of population, smooth curve
without considering intervals. will be easier to work with
irregularities and outliers. More the number of than a histogram
intervals, better will be the
visualization of data.
Numericals for Practice
1. The box plot below was constructed from a collection of times taken to run a
100 m sprint. Using the box plot, determine the range and interquartile range.
For categorical data, simple proportions or percentages tell the story of the
data.
Getting a summary of a binary variable or a categorical variable with a few
categories is a fairly easy matter: we just figure out the proportion of 1s, or the
proportions of the important categories.
Bar charts are a common visual tool for displaying a single categorical variable.
Categories are listed on the x-axis, and frequencies or pro‐ portions on the y-
axis.
A bar chart resembles a histogram.
In a bar chart the x-axis represents different categories of a factor variable,
while in a histogram the x-axis represents values of a single variable on a
numeric scale.
In a histogram, the bars are typically shown touching each other, with gaps
indicating values that did not occur in the data.
In a bar chart, the bars are shown separate from one another.
Histograms and bar charts are similar, except that the categories on the x-axis
in the bar chart are not ordered.
Converting numeric data to categorical data is an important and widely used
step in data analysis since it reduces the complexity (and size) of the data.
This aids in the discovery of relationships between features, particularly at
the initial stages of an analysis.
Expected Value
A marketer for a new cloud technology, for example, offers two levels of
service, one priced at Rs.300/month and another at Rs.50/month. The
marketer offers free webinars to generate leads, and the firm figures that 5% of
the attendees will sign up for the Rs.300 service, 15% will sign up for the Rs.50
service, and 80% will not sign up for anything. This data can be summed up, for
financial purposes, in a single “expected value,” which is a form of weighted
mean, in which the weights are probabilities.
In the cloud service example, the expected value of a webinar attendee is thus
Rs.22.50 per month, calculated as follows:
EV = (0 . 05) 300 + (0 . 15) 50 + (0 . 80) 0 = 22 . 5
The expected value is really a form of weighted mean: it adds the ideas of
future expectations and probability weights, often based on subjective
judgment.
Correlation coefficient :
Where,
r = Pearson correlation coefficient
x = Values in the first set of data
y = Values in the second set of data
n = Total number of values.
Like the mean and standard deviation, the correlation coefficient is sensitive to
outliers in the data.
The correlation coefficient measures the extent to which two paired variables
(e.g., height and weight for individuals) are associated with one another.
Numericals for Practice
1. Calculate the correlation coefficient for the following data.
X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20.
Scatterplots
A scatter plot uses dots to represent values for two different numeric variables
and are used to indicate relationship between the variables.
The position of each dot on the horizontal and vertical axis indicates values for
an individual data point.
Identification of correlational relationships are common with scatter plots.
Relationships between variables can be described in many ways: positive or
negative, strong or weak, linear or nonlinear.
Exploring Two or More Variables
Familiar estimators like mean and variance look at variables one at a time
(univariate analysis).
Correlation analysis is an important method that compares two variables
(bivariate analysis).
In this section we look at additional estimates and plots, and at more than two
variables (multivariate analysis).
Like univariate analysis, bivariate analysis involves both computing summary
statistics and producing visual displays.
The appropriate type of bivariate or multivariate analysis depends on the nature
of the data: numeric versus categorical.
Hexagonal Binning and Contours
(Plotting Numeric Versus Numeric Data)
Scatterplots are fine when there is a relatively small number of data values.
For data sets with hundreds of thousands or millions of records, a scatterplot will
be too dense, so we need a different way to visualize the relationship.
Instead of plotting individual data points, which can lead to over plotting in
dense areas, the data space is divided into a grid of hexagons.
The number of points falling into each hexagon is counted, and the hexagons are
then colored according to this count, allowing for a clearer visual representation
of data density.
Figure shows a hexagonal binning plot of the
relationship between the finished square
feet and the tax-assessed value for homes in
a city.
Rather than plotting points, which would
appear as a monolithic dark cloud, we
grouped the records into hexagonal bins and
plotted the hexagons with a color indicating
the number of records in that bin.
Hexagonal Grid: The plot uses a grid of hexagons, which helps to minimize overlap
and provide a more uniform visual structure compared to rectangular grids.
Density Representation: The color or shading of each hexagon represents the
number of data points within it. Darker or more intense colors typically indicate
higher densities of data points.
Over plotting Solution: By aggregating data points within hexagons, hexagonal
binning plots help to reduce the problem of over plotting in areas of high data
density.
Contour Plot
A contour plot is a graphical representation used to show the three-dimensional
relationship between two variables (usually on the x and y axes) and a third
variable (represented by contour lines or color gradients) on a two-dimensional
plane.
Contour plots to display the relationship between two independent variables
and a dependent variable.
The graph shows values of the Z variable for combinations of the X and Y
variables. The X and Y values are displayed along the X and Y-axes, while contour
lines and bands represent the Z value.
The contour lines connect combinations of the X and Y variables that produce
equal values of Z.
Contour plots are particularly helpful
when you need to identify combinations
of X and Y that produce beneficial or
required values of Z.
The contour lines and bands make it easy
to find combinations that yield the
values you need.
Two Categorical Variables
In the contingency table above, the two categorical variables are gender and ice
cream flavor preference.
This is a two-way table (2 X 3) where each cell represents the number of times
males and females prefer a particular ice cream flavor.
If there is a relationship between ice cream preference and gender, we’d expect
the conditional distribution of flavors in the two gender rows to differ.
From the contingency table, females are more likely to prefer chocolate (37 vs.
21), while males prefer vanilla (32 vs. 12).
Both genders have an equal preference for strawberry.
Overall, the two-way table suggests that males and females have different ice
cream preferences.
The Total column indicates the researchers surveyed 66 females and 71 males.
Because we have roughly equal numbers, we can compare the raw counts
directly. However, when you have unequal groups, use percentages to compare
them.
Row Percentage: Take a cell value and divide by the cell’s row total.
Column Percentage: Take a cell value and divide by the cell’s column total.
For example, the row percentage of females who prefer chocolate is simply the
number of observations in the Female/Chocolate cell divided by the row total for
women: 37 / 66 = 56%.
The column percentage for the same cell is the frequency of the
Female/Chocolate cell divided by the column total for chocolate: 37 / 58 =
63.8%.
A bar charts can be used to display a contingency table.
The following clustered bar chart shows the row percentages for the previous
two-way table.
This bar chart reiterates our conclusions from the contingency table.
Women in this sample prefer chocolate, men favor vanilla, and both genders
have an equal preference for strawberry.
Categorical and Numeric Data