Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views53 pages

Module 1

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views53 pages

Module 1

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

University of Mumbai

Program – Bachelor of Engineering in


Computer Science and Engineering (Artificial Intelligence
and Machine Learning)

Class - T.E.
Course Code – CSDLO5011

Course Name – Statistics for Artificial


Intelligence Data Science

By
Prof. A.V.Phanse
Exploratory Data Analysis

 Classical statistics focused exclusively on inference, a complex set of


procedures for drawing conclusions about large populations based on small
samples.

 In 1962, John W. Tukey proposed a new scientific discipline called data analysis
that included statistical inference as just one component.

 Tukey presented simple plots (e.g., boxplots, scatterplots) that, along with
summary statistics (mean, median, quantiles etc.), help paint a picture of a
data set.

John Tukey
Elements of Structured Data

 Data comes from many sources: sensor measurements, events, text, images, and
videos etc.

 Much of this data is unstructured: images are a collection of pixels, with each
pixel containing RGB (red, green, blue) color information. Texts are sequences of
words and nonword characters, often organized by sections, subsections, and so
on. Clickstreams are sequences of actions by a user interacting with an app or a
web page.

 A major challenge of data science is to harness this raw data into actionable
information.

 To apply the statistical concepts, unstructured raw data must be processed and
manipulated into a structured form.

 One of the common forms of structured data is a table with rows and columns
 There are two basic types of structured data: numeric and categorical.

 Numeric data comes in two forms: continuous, such as wind speed or time
duration, and discrete, such as the count of the occurrence of an event.

 Categorical data takes only a fixed set of values, such as a type of TV screen
(plasma, LCD, LED, etc.) or a state name ( Maharashtra, Goa etc.).

 Binary data is an important special case of categorical data that takes on only
one of two values, such as 0/1, yes/no, or true/false.

 Another useful type of categorical data is ordinal data in which the categories
are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5).
Rectangular Data
 Rectangular data is the general term for a two-dimensional matrix with rows
indicating records (cases) and columns indicating features (variables).

 The data doesn’t always start in this form: unstructured data must be processed
and manipulated so that it can be represented as a set of features in the
rectangular data.

 Data in relational databases must be extracted and put into a single table for
most data analysis and modeling tasks.

 A data frame is the specific format in R and Python.

Feature –
A column within a table is commonly referred
to as a feature

Records –
A row within a table is commonly referred to
as a record.
Nonrectangular Data Structures
 There are other data structures besides rectangular data.
 Time series data records successive measurements of the same variable. It is the
raw material for statistical forecasting methods, and it is also a key component
of the data produced by devices—the Internet of Things.

 Spatial data structures, which are used in mapping and location analytics, are
more complex and varied than rectangular data structures.

 In the object representation, the focus of the data is an object (e.g., a house)
and its spatial coordinates. The field view, by contrast, focuses on small units of
space and the value of a relevant metric (e.g. pixel brightness).

 Graph (or network) data structures are used to represent physical, social, and
abstract relationships. For example, a graph of a social network, such as
Facebook or LinkedIn, may represent connections between people on the
network.

 Distribution hubs connected by roads are an example of a physical network.


Each of these data types has its specialized methodology in data science.
Estimates of Location

 Variables with measured data might have thousands of distinct values.


 A basic step in exploring your data is getting a “typical value” for each variable.
It is an estimate of where most of the data is located (i.e. its central tendency).

Mean –

 The most basic estimate of location is the mean, or average value.


 The mean is the sum of all values divided by the number of values.
 Consider the following set of numbers: {3 5 1 2}.
 The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75.
 You will encounter the symbol pronounced “x-bar” being used to represent the
mean of a sample from a population.

 N (or n) refers to the total number of records or observations.


 In statistics it is capitalized if it is referring to a population, and lower case if it
refers to a sample from a population.
For calculating the mean when the frequency of the observations is given, such that x1,
x2, x3,… xn is the recorded observations, and f1, f2, f3 … fn is the respective frequencies of the
observations then;

For calculating the mean of grouped data, we calculate the class mark. For this, the
midpoints of the class intervals are calculated as:

Trimmed Mean –
 A variation of the mean is a trimmed mean, which you calculate by dropping a
fixed number of sorted values at each end and then taking an average of the
remaining values.

 Representing the sorted values by x1 , x2 , ..., xn where x1 is the smallest value


and xn the largest, the formula to compute the trimmed mean with p smallest
and largest values omitted is:
 A trimmed mean eliminates the influence of extreme values.

 For example, in a dance competition the top score and bottom score from five
judges are dropped, and the final score is the average of the scores from the
three remaining judges.
 This makes it difficult for a single judge to manipulate the score, perhaps to
favor any contestant.

Weighted mean –

 A weighted mean is calculated by multiplying each data value xi by a user-


specified weight wi and dividing their sum by the sum of the weights.

 The formula for a weighted mean is:

 Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight.

 For example, if we are taking the average from multiple sensors and one of the
sensors is less accurate, then we might downweight the data from that sensor.
In a class of 30 students, marks obtained by students in mathematics out of 50 is
tabulated below. Calculate the mean of the data.

The mean of the data given above is,

Thus, the mean of the given data is 34.


Median and Robust Estimates

 The median is the middle number on a sorted list of the data.

 If there is an even number of data values, the middle value is one that is not
actually in the data set, but rather the average of the two values that divide the
sorted data into upper and lower halves.

 Compared to the mean, which uses all observations, the median depends only
on the values in the center of the sorted data.

 While this might seem to be a disadvantage, since the mean is much more
sensitive to the data, there are many instances in which the median is a better
metric for location.

 The median is referred to as a robust estimate of location since it is not


influenced by outliers (extreme cases) that could skew the results.

 An outlier is any value that is very distant from the other values in a data set.
 Presence of an outlier does not make a data value invalid. Still, outliers are often
the result of data errors such as mixing data of different units (kilometers versus
meters) or bad readings from a sensor.

 When outliers are the result of bad data, the mean will result in a poor estimate
of location, while the median will still be valid.

 In any case, outliers should be identified and are usually worthy of further
investigation.

 The median is not the only robust estimate of location. In fact, a trimmed mean
is widely used to avoid the influence of outliers.

 For example, trimming the bottom and top 10% of the data will provide
protection against outliers in all but the smallest data sets.

 The trimmed mean can be thought of as a compromise between the median


and the mean: it is robust to extreme values in the data, but uses more data to
calculate the estimate for location.
Estimates of Variability

 Location is just one dimension in summarizing a feature.

 A second dimension, variability, also referred to as dispersion which measures


whether the data values are tightly clustered or spread out.

 At the heart of statistics lies variability: measuring it, reducing it, distinguishing
random from real variability, identifying the various sources of real variability,
and making decisions in the presence of it.

Standard Deviation and Related Estimates

 The most widely used estimates of variation are based on the differences, or
deviations, between the estimate of location and the observed data.
 For a set of data {1, 4, 4}, the mean is 3 and the median is 4.
 The deviations from the mean are the differences:
1 – 3 = –2, 4 – 3 = 1, 4 – 3 = 1.
 These deviations tell us how dispersed the data is around the central value.
 One way to measure variability is to estimate a typical value for these
deviations.
 Averaging the deviations themselves would not tell us much. The negative
deviations offset the positive ones.
 In fact, the sum of the deviations from the mean is precisely zero.
 Instead, a simple approach is to take the average of the absolute values of the
deviations from the mean.
 In the preceding example, the absolute value of the deviations is {2 1 1}, and
their average is (2 + 1 + 1) / 3 = 1.33.
 This is known as the mean absolute deviation and is computed with the
formula:

The best-known estimates of variability are the variance and the standard
deviation, which are based on squared deviations.
Variance
The variance is an average of the squared deviations, and the standard deviation is
the square root of the variance

 The standard deviation is much easier to interpret than the variance since it is
on the same scale as the original data.
 Neither the variance, the standard deviation, nor the mean absolute deviation
is robust to outliers and extreme values
 The variance and standard deviation are especially sensitive to outliers since
they are based on the squared deviations.

A robust estimate of variability is the median absolute deviation from the median
or MAD:

where m is the median. Like the median, the MAD is not influenced by extreme values.
Estimates Based on Percentiles

 A different approach to estimate dispersion is based on looking at the spread of


the sorted data. Statistics based on sorted (ranked) data are referred to as order
statistics.

 The most basic measure is the range: the difference between the largest and
smallest numbers.

 The minimum and maximum values themselves are useful to know and are
helpful in identifying outliers.

 But, the range is extremely sensitive to outliers and not very useful as a general
measure of dispersion in the data.
 To avoid the sensitivity to outliers, we can look at the range of the data after
dropping values from each end. Formally, these types of estimates are based on
differences between percentiles.

Quartiles:
Data can be divided into four regions that cover the total range of observed values.
Cut points for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data,


where q is the desired quartile and n is the number of observations of data.

The first quartile (Q1) is the first 25% of the data.


The second quartile (Q2) is between the 25th and 50th percentage points in the
data.
The upper bound of Q2 is the median.
The third quartile (Q3) is the 25% of the data lying between the median and
the 75% cut point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is the
median of the second half of the ordered observations.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3

The first quartile is Q1=11.


The second quartile is Q2=40 (This is also the Median.)
The third quartile is Q3=61.

In the above example Q1= ((15+1)/4)1 =4th observation of the data. The 4th
observation is 11. So Q1 is of this data is 11.

Inter-quartile Range:
Difference between Q3 and Q1.
Inter-quartile range of the previous example is 61- 11=50.
The middle half of the ordered data lie between 11 and 61.
Deciles:
If data is ordered and divided into 10 parts, then cut points are called Deciles

Percentiles:
If data is ordered and divided into 100 parts, then cut points are called
Percentiles.
25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile
of the data is Q3.
In notations, percentiles of a data is the ((n+1)/100)p th observation of the data,
where p is the desired percentile and n is the number of observations of data.

Coefficient of Variation:

The standard deviation of data divided by it’s mean. It is usually expressed in


percent.

Coefficient of Variation = 100
x
Q.1 The scores for some candidates in a test are 40, 45, 49, 53, 61, 65, 71, 79, 85, 91.
What will be the percentile for the score 71?

Thus, Percentile of 71 will be 60.


Q.2: The scores for some candidates in a test are 40, 45, 49, 53, 61, 65, 71, 79, 85,
91.
What will be the score with a percentile value of 90?
Numericals for Practice

1. The mean of 6, 8, x + 2, 10, 2x - 1, and 2 is 9.


Find the value of x and also the value of the observation in the data.

2. The runs scored in a cricket match by 11 players is as follows:


7, 16, 121, 51, 101, 81, 1, 16, 9, 11, 16
Find the mean, mode, median of this data

3. The mean of the following distribution is 26. Find the value of p and also the
value of the observation.

Also, find the mode and the given data

4. If a die is rolled, then find the variance and standard deviation of the possibilities.

5. Find the standard deviation of the average temperatures recorded over a five
day period last winter: 18, 22, 19, 25, 12
6. A survey of 36 students of a class was done to find out the mode of transport
used by them while commuting to the school. The collected data is shown in the
table given below. Represent the data in the form of a bar graph.

7. Construct a frequency distribution table for the following weights (in gm) of 30
oranges using the equal class intervals, one of them is 40-45 (45 not included).
The weights are: 31, 41, 46, 33, 44, 51, 56, 63, 71, 71, 62, 63, 54, 53, 51, 43, 36, 38,
54, 56, 66, 71, 74, 75, 46, 47, 59, 60, 61, 63.
(a) What is the class mark of the class intervals 50-55?
(b) What is the range of the above weights?
(c) How many class intervals are there?
(d) Which class interval has the lowest frequency?
Exploring the Data Distribution

 Each of the estimates we’ve covered sums up the data in a single number to
describe the location or variability of the data.
 It is also useful to explore how the data is overall distributed.

Percentiles and Boxplots

 Boxplots, introduced by Tukey (1977), are based on percentiles and give a quick
way to visualize the distribution of data.
 From this boxplot we can immediately see that the
median state population is about 5 million, half the
states fall between about 2 million and about 7
million, and there are some high population outliers.
 The top and bottom of the box are the 75th and
25th percentiles, respectively.
 The median is shown by the horizontal line in the
box.
 The dashed lines, referred to as whiskers, extend
from the top and bottom of the box to indicate the
range for the bulk of the data.
Boxplot of state populations
Box Plot:

 A box plot is a graph of the five number summary.


 The central box spans the quartiles.
 A line within the box marks the median.
 Lines extending above and below the box mark the smallest and the largest
observations (i.e., the range).
 Outlying samples may be additionally plotted outside the range.
Frequency Tables and Histograms

 A frequency table of a variable divides up the variable range into equally


spaced segments and tells us how many values fall within each segment.

 A frequency table shows how many times each value appears in a dataset.

 It’s a useful way to organize and summarize data. However, it can be harder to
see patterns in the data just by looking at a frequency table.

 A histogram shows the frequency of each value in a dataset using bars.

 Each bar represents a range of values, and the height of the bar represents the
frequency of values in that range.

 This can make it easier to see patterns in the data, such as if the data is evenly
distributed or skewed to one side.
Suppose we collect the exam scores of 20 students in some class:

Scores: 50, 58, 62, 65, 70, 71, 72, 74, 74, 78, 81, 82, 82, 85, 87, 88, 89, 92, 94, 96

Frequency Table Histogram

The x-axis of the histogram displays bins of data values and the y-axis tells us how
many observations in a dataset fall in each bin.
Density Plots and Estimates
 A density plot shows the distribution of data values as a continuous line.
 A density plot can be thought of as a smoothed histogram, although it is typically
computed directly from the data through a kernel density estimate.
 A density plot corresponds to plotting the histogram as a proportion rather than
counts.
 Note that the total area under the density curve = 1 or 100 %, and instead of
counts in bins you calculate areas under the curve between any two points on the
x-axis, which correspond to the proportion of the distribution lying between
those two points.
Advantages of Density Plots

Density plots give actual In density plot, there is no When working with large
distribution of data values limitation on number of population, smooth curve
without considering intervals. will be easier to work with
irregularities and outliers. More the number of than a histogram
intervals, better will be the
visualization of data.
Numericals for Practice
1. The box plot below was constructed from a collection of times taken to run a
100 m sprint. Using the box plot, determine the range and interquartile range.

2. The histogram for a frequency distribution is given below.


Answer the following.
(i) What is the frequency of the class interval 15 – 20?
(ii) What is the class intervals having the greatest
frequency?
(iii) What is the cumulative frequency of the class interval
25 – 30?
(iv) Construct a short frequency table of the distribution.
(v) Construct a cumulative frequency table of the
distribution.
Exploring Binary and Categorical Data

 For categorical data, simple proportions or percentages tell the story of the
data.
 Getting a summary of a binary variable or a categorical variable with a few
categories is a fairly easy matter: we just figure out the proportion of 1s, or the
proportions of the important categories.

 Bar charts are a common visual tool for displaying a single categorical variable.
Categories are listed on the x-axis, and frequencies or pro‐ portions on the y-
axis.
 A bar chart resembles a histogram.
 In a bar chart the x-axis represents different categories of a factor variable,
while in a histogram the x-axis represents values of a single variable on a
numeric scale.
 In a histogram, the bars are typically shown touching each other, with gaps
indicating values that did not occur in the data.
 In a bar chart, the bars are shown separate from one another.

 Histograms and bar charts are similar, except that the categories on the x-axis
in the bar chart are not ordered.
 Converting numeric data to categorical data is an important and widely used
step in data analysis since it reduces the complexity (and size) of the data.
 This aids in the discovery of relationships between features, particularly at
the initial stages of an analysis.
Expected Value

 A special type of categorical data is data in which the categories represent or


can be mapped to discrete values on the same scale.

 A marketer for a new cloud technology, for example, offers two levels of
service, one priced at Rs.300/month and another at Rs.50/month. The
marketer offers free webinars to generate leads, and the firm figures that 5% of
the attendees will sign up for the Rs.300 service, 15% will sign up for the Rs.50
service, and 80% will not sign up for anything. This data can be summed up, for
financial purposes, in a single “expected value,” which is a form of weighted
mean, in which the weights are probabilities.

 The expected value is calculated as follows:


1. Multiply each outcome by its probability of occurrence.
2. Sum these values.

 In the cloud service example, the expected value of a webinar attendee is thus
Rs.22.50 per month, calculated as follows:
EV = (0 . 05) 300 + (0 . 15) 50 + (0 . 80) 0 = 22 . 5
 The expected value is really a form of weighted mean: it adds the ideas of
future expectations and probability weights, often based on subjective
judgment.

 Expected value is a fundamental concept in business valuation and capital


budgeting—for example, the expected value of five years of profits from a new
acquisition, or the expected cost savings from new patient management
software at a clinic.
Correlation

 Exploratory data analysis involves examining correlation among variables.


 Variables X and Y (each with measured data) are said to be positively correlated
if high values of X go with high values of Y, and low values of X go with low
values of Y.
 If high values of X go with low values of Y, and vice versa, the variables are
negatively correlated.
 More useful is a standardized variant: the correlation coefficient, which gives an
estimate of the correlation between two variables that always lies on the same
scale.

 Correlation coefficient :
Where,
r = Pearson correlation coefficient
x = Values in the first set of data
y = Values in the second set of data
n = Total number of values.

 The correlation coefficient always lies between +1 (perfect positive correlation)


and –1 (perfect negative correlation); 0 indicates no correlation.

 Like the mean and standard deviation, the correlation coefficient is sensitive to
outliers in the data.

 The correlation coefficient measures the extent to which two paired variables
(e.g., height and weight for individuals) are associated with one another.
Numericals for Practice
1. Calculate the correlation coefficient for the following data.
X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20.
Scatterplots
 A scatter plot uses dots to represent values for two different numeric variables
and are used to indicate relationship between the variables.
 The position of each dot on the horizontal and vertical axis indicates values for
an individual data point.
 Identification of correlational relationships are common with scatter plots.
 Relationships between variables can be described in many ways: positive or
negative, strong or weak, linear or nonlinear.
Exploring Two or More Variables

 Familiar estimators like mean and variance look at variables one at a time
(univariate analysis).
 Correlation analysis is an important method that compares two variables
(bivariate analysis).
 In this section we look at additional estimates and plots, and at more than two
variables (multivariate analysis).
 Like univariate analysis, bivariate analysis involves both computing summary
statistics and producing visual displays.
 The appropriate type of bivariate or multivariate analysis depends on the nature
of the data: numeric versus categorical.
Hexagonal Binning and Contours
(Plotting Numeric Versus Numeric Data)

Hexagonal Binning Plot

 Scatterplots are fine when there is a relatively small number of data values.

 For data sets with hundreds of thousands or millions of records, a scatterplot will
be too dense, so we need a different way to visualize the relationship.

 A hexagonal binning plot is a type of data visualization used to display the


distribution and density of large datasets, especially those with two continuous
variables.

 Instead of plotting individual data points, which can lead to over plotting in
dense areas, the data space is divided into a grid of hexagons.

 The number of points falling into each hexagon is counted, and the hexagons are
then colored according to this count, allowing for a clearer visual representation
of data density.
 Figure shows a hexagonal binning plot of the
relationship between the finished square
feet and the tax-assessed value for homes in
a city.
 Rather than plotting points, which would
appear as a monolithic dark cloud, we
grouped the records into hexagonal bins and
plotted the hexagons with a color indicating
the number of records in that bin.

Key Features of Hexagonal Binning Plots:

Hexagonal Grid: The plot uses a grid of hexagons, which helps to minimize overlap
and provide a more uniform visual structure compared to rectangular grids.
Density Representation: The color or shading of each hexagon represents the
number of data points within it. Darker or more intense colors typically indicate
higher densities of data points.
Over plotting Solution: By aggregating data points within hexagons, hexagonal
binning plots help to reduce the problem of over plotting in areas of high data
density.
Contour Plot
 A contour plot is a graphical representation used to show the three-dimensional
relationship between two variables (usually on the x and y axes) and a third
variable (represented by contour lines or color gradients) on a two-dimensional
plane.
 Contour plots to display the relationship between two independent variables
and a dependent variable.
 The graph shows values of the Z variable for combinations of the X and Y
variables. The X and Y values are displayed along the X and Y-axes, while contour
lines and bands represent the Z value.
 The contour lines connect combinations of the X and Y variables that produce
equal values of Z.
 Contour plots are particularly helpful
when you need to identify combinations
of X and Y that produce beneficial or
required values of Z.
 The contour lines and bands make it easy
to find combinations that yield the
values you need.
Two Categorical Variables

 A useful way to summarize two categorical variables is a contingency table—a


table of counts by category.
 A contingency table displays frequencies for combinations of two categorical
variables.
 Contingency tables classify outcomes for one variable in rows and the other in
columns. The values at the row and column intersections are frequencies for
each unique combination of the two variables.

 In the contingency table above, the two categorical variables are gender and ice
cream flavor preference.
 This is a two-way table (2 X 3) where each cell represents the number of times
males and females prefer a particular ice cream flavor.
 If there is a relationship between ice cream preference and gender, we’d expect
the conditional distribution of flavors in the two gender rows to differ.
 From the contingency table, females are more likely to prefer chocolate (37 vs.
21), while males prefer vanilla (32 vs. 12).
 Both genders have an equal preference for strawberry.
 Overall, the two-way table suggests that males and females have different ice
cream preferences.
 The Total column indicates the researchers surveyed 66 females and 71 males.
Because we have roughly equal numbers, we can compare the raw counts
directly. However, when you have unequal groups, use percentages to compare
them.
Row Percentage: Take a cell value and divide by the cell’s row total.
Column Percentage: Take a cell value and divide by the cell’s column total.

 For example, the row percentage of females who prefer chocolate is simply the
number of observations in the Female/Chocolate cell divided by the row total for
women: 37 / 66 = 56%.
 The column percentage for the same cell is the frequency of the
Female/Chocolate cell divided by the column total for chocolate: 37 / 58 =
63.8%.
 A bar charts can be used to display a contingency table.
 The following clustered bar chart shows the row percentages for the previous
two-way table.

 This bar chart reiterates our conclusions from the contingency table.
 Women in this sample prefer chocolate, men favor vanilla, and both genders
have an equal preference for strawberry.
Categorical and Numeric Data

 Boxplots are a simple way to visually compare the distributions of a numeric


variable grouped according to a categorical variable.
 For example, we might want to compare how the percentage of flight delays
varies across airlines. Figure shows the percentage of flights in a month that
were delayed where the delay was within the carrier’s control:

 Alaska stands out as having the


fewest delays, while American has
the most delays: the lower quartile
for American is higher than the
upper quartile for Alaska.
Violin Plot

 A violin plot is an enhancement to the boxplot.


 It plots the density estimate with the density on the y-axis.
 The density is mirrored and flipped over, and the resulting shape is filled in,
creating an image resembling a violin.
 The advantage of a violin plot is that it can show slight differences in the
distribution that aren’t recognizable in a boxplot. On the other hand, the
boxplot more clearly shows the outliers in the data.
 The advantage of using a violin plot over a boxplot is, in addition to median
and interquartile range, it also provides visual presentation of distribution
and density of the data.
 This is useful when you want to see the shape of the data distribution.
Thank You…

You might also like