Unit - I Introduction to Data Analytics
1.1 Data Analytics:
Data analytics converts raw data into actionable insights. It includes a range
of tools, technologies, and processes used to find trends and solve problems
by using data. Data analytics can shape business processes, improve decision-
making, a this type of analysis helps describe or summaries quantitative data
by presenting statistics.
For example, descriptive statistical analysis could show sales distribution
across a group of employees and the average sales figure per employee.
Descriptive analysis answers the question, “What happened and foster
business growth.
Data analytics is an important field that involves the process of collecting,
processing, and interpreting data to uncover insights and help in making
decisions. Data analytics is the practice of examining raw data to identify
trends, draw conclusions, and extract meaningful information. This
involves various techniques and tools to process and transform data into
valuable insights that can be used for decision-making.
NEED OF DATA ANALYTICS:
The use of data analytics in product development is a reliable understanding
of future requirements. The company will understand the current market
situation of the product. They can use the techniques to develop new products
as per market requirements. The ability to make data-driven decisions can
give organizations a competitive edge in their markets. Data analysts are
essential for leveraging the power of data. They use data and turn it into
meaningful insights that can drive better decision-making.
Importance of Data Analytics
Data analytics is important because it helps businesses optimize their
performances. Implementing it into the business model means companies can
help reduce costs by identifying more efficient ways of doing business and by
storing large amounts of data.
Improved Decision-Making – If we will have supporting data in favor of a
decision that then we will be able to implement them with even more success
probability. For example, if a certain decision or plan has to lead to better
outcomes then there will be no doubt in implementing them again.
Better Customer Service – Churn modeling is the best example of this in
which we try to predict or identify what leads to customer churn and change
those things accordingly so, that the attrition of the customers is as low as
possible which is a most important factor in any organization.
Efficient Operations – Data Analytics can help us understand what the
demand of the situation is and what should be done to get better results then
we will be able to streamline our processes which in turn will lead to efficient
operations.
Effective Marketing – Market segmentation techniques have been
implemented to target this important factor only in which we are supposed to
find the marketing techniques which will help us increase our sales and leads
to effective marketing strategies.
DATA ANALYTICS
Analytics is the discovery and communication of meaningful patterns in
data. Especially, valuable in areas rich with recorded information, analytics
relies on the simultaneous application of statistics, computer programming,
and operation research to qualify performance. Analytics often favors data
visualization to communicate insight. Firms may commonly apply analytics
to business data, to describe, predict, and improve business performance.
Especially, areas within include predictive analytics, enterprise decision
management, etc. Since analytics can require extensive computation (because
of big data), algorithms and software harness the most current methods in
computer science. Data Analytics aims to get actionable insights resulting in
smarter decisions and better business outcomes. It is critical to design and
built a data warehouse or Business Intelligence (BI) architecture that provides
a flexible, multi-faceted analytical ecosystem, optimized for efficient
ingestion and analysis of large and diverse data sets.
What is Data Analytics? In this new digital world, data is being generated in
an enormous amount which opens new paradigms. As we have high
computing power as well as a large amount of data we can make use of this
data to help us make data-driven decision making. The main benefits of data-
driven decisions are that they are made up by observing past trends which
have resulted in beneficial results. In short, we can say that data analytics is
the process of manipulating data to extract useful trends and hidden patterns
which can help us derive valuable insights to make business predictions
1.2TYPES / CATEGORIES/ MODELS OF DATA ANALYTICS
There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics Predictive analytics turn the data into valuable,
actionable information. Predictive analytics uses data to determine the
probable outcome of an event or a likelihood of a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event.
Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Corner Stone’s of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics Descriptive analytics looks at data and analyze past
event for insight as to how to approach future events. It looks at past
performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often
used to classify customers or prospects into groups. Unlike a predictive
model that focuses on predicting the behavior of a single customer,
Descriptive analytics identifies many different relationships between
customer and product.
Common examples of Descriptive analytics are company reports that provide
historic reviews like:
● Data Queries
● Reports
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical
science, business rule, and machine learning to make a prediction and then
suggests a decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also
suggesting action benefits from the predictions and showing the decision
maker the implication of each decision option.
Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can
suggest decision options on how to take advantage of a future opportunity or
mitigate a future risk and illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning
by using analytics to leverage operational and usage data combined with data
of external factors such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency
and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight
into a problem, and they also keep detailed information about their disposal
otherwise data collection may turn out individual for every problem and it
will be very time-consuming. Common techniques used for Diagnostic
Analytics are:
Data discovery
Data mining
Correlations
1.3 Life cycle of Data Analytics, Quality and Quantity of data,
Measurement
Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data
science projects. The cycle is iterative to represent real project. To address
the distinct requirements for performing analysis on Big Data, step–by–
step methodology is needed to organize the activities and tasks involved
with acquiring, processing, analyzing, and repurposing data.
● Phase 1: Discovery –
● The data science team learns and investigates the problem.
● Develop context and understanding.
● Come to know about data sources needed and available for the
project.
● The team formulates the initial hypothesis that can be later
tested with data.
● Phase 2: Data Preparation -
● Steps to explore, preprocess, and condition data before modeling
and analysis.
● It requires the presence of an analytic sandbox, the team
executes, loads, and transforms, to get data into the sandbox.
● Data preparation tasks are likely to be performed multiple times
and not in predefined order.
● Several tools commonly used for this phase are - Hadoop, Alpine
Miner, Open Refine, etc.
● Phase 3: Model Planning -
● The team explores data to learn about relationships between
variables and subsequently, selects key variables and the most
suitable models.
● In this phase, the data science team develops data sets for
training, testing, and production purposes.
● Team builds and executes models based on the work done in the
model planning phase.
● Several tools commonly used for this phase are - Matlab and
STASTICA.
● Phase 4: Model Building -
● Team develops datasets for testing, training, and production
purposes.
● Team also considers whether its existing tools will suffice for
running the models or if they need more robust environment for
executing models.
● Free or open-source tools - Rand PL/R, Octave, WEKA.
● Commercial tools - Matlab and STASTICA.
● Phase 5: Communication Results -
● After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
● Team considers how best to articulate findings and outcomes to
various team members and stakeholders, taking into account
warning, assumptions.
● Team should identify key findings, quantify business value, and
develop narrative to summarize and convey findings to
stakeholders.
● Phase 6: Operationalize -
● The team communicates benefits of project more broadly and
sets up pilot project to deploy work in controlled way before
broadening the work to full enterprise of users.
● This approach enables team to learn about performance and
related constraints of the model in production environment on
small scale which make adjustments before full deployment.
● The team delivers final reports, briefings, codes.
● Free or open source tools - Octave, WEKA, SQL, MADlib.
Quality and Quantity of data, Measurement
Data quality and quantity are crucial aspects of measurement, influencing the
reliability and validity of research and decision-making. Data quality refers to
the accuracy, completeness, consistency, and timeliness of the data, while
data quantity refers to the amount of data collected.
● Data Quality:
Accuracy: How closely the data reflects the true value or event.
Completeness: The extent to which all necessary information is present.
Consistency: The degree to which data values are uniform and free from
contradictions.
Timeliness: The freshness and relevance of the data.
Relevance: How well the data aligns with the intended purpose of analysis.
Validity: Whether the data is appropriate for the research question or
decision-making process.
Uniqueness: The absence of duplicate or redundant entries.
Integrity: The assurance that the data is accurate and reliable.
● Data Quantity:
Sample Size: The number of observations or participants in a study.
Data Coverage: The extent to which the data represents the population or
phenomenon being studied.
Data Depth: The level of detail and granularity of the data.
● Importance of Data Quality and Quantity:
Reliability: High-quality data ensures the reliability of research findings and
decision-making.
Validity: The validity of conclusions depends on the quality and quantity of
the data used.
Generalizability: Adequate data quantity and quality allow for the
generalizability of findings to larger populations or contexts.
Informed Decision-Making: Quality data provides a solid foundation for
making informed decisions.
Cost-Effectiveness: Poor data quality can lead to wasted resources and
inaccurate conclusions.
● Measuring Data Quality and Quantity:
Data Quality Metrics:
Specific metrics are used to assess the quality of data, such as accuracy
rates, completeness percentages, and consistency scores.
Data Validation Rules:
Rules are established to ensure that data meets predefined quality
standards.
Data Profiling:
Tools and techniques are used to analyze data and identify potential quality
issues.
Data Audits:
Regular audits are conducted to ensure data quality and integrity.
he role of data integrity vs. data quality is often confusing. Data quality
focuses on accuracy, completeness, and other attributes to make sure
that data is reliable. Data integrity, on the other hand, makes this
reliable data useful. It adds relationships and context to enrich data for
improving its effectiveness.
The difference between data integrity and data quality is in the level of
value they offer. Data quality works as the foundation for trusted
business decisions, while data integrity takes it one notch higher to
deliver better business decisions.
7 Metrics & Their Measurement
Techniques
Here, we will walk you through the seven dimensions you need to consider
when assessing data quality in your organization. They are:
1. Accuracy
2. Completeness
3. Consistency
4. Validity
5. Timeliness
6. Uniqueness
7. Integrity
Accuracy
Data accuracy measures the correctness and completeness of specific data.
Essentially, you’re evaluating how much the information accurately reflects the
event or object described.
The accuracy of your data is essential to avoid problems with transaction
processing or generating faulty results in analytics.
Your decision-making is also dependent on high accuracy, as incorrect data
will likely result in poor decisions.
Imagine forecasting your manufacturing budget based on data that
inaccurately represents the number of units sold the previous year. A mistake
like this could have serious financial consequences in the short and long term.
Two practical ways of evaluating data accuracy are data profiling and
validation:
● Data profiling is the process of analyzing your data to identify
inconsistencies, errors, and other anomalies.
● Data validation assesses your data to ensure it is accurate, complete,
and consistent with each other by checking it against predefined criteria.
This process can be done manually, but it’s usually automated.
Related Post: Data Governance and Data Quality: Working Together
Completeness
The completeness of your data measures how many missing or null values
there are. This can be at the individual data level, across tables, or the whole
database.
Without this, you can’t be confident that you have all the data you need. And
without this confidence, any output, analysis, or decision is inherently flawed.
You must ensure your data is complete and include all critical information. Or,
at the very least, you need to have a clear idea of what data you’re missing so
you can account for it.
For example, when formulating a marketing plan, it’s important to know the
demographic of your intended customer. If any important information like age,
employment status, or location is missing, it would be impossible to create an
accurate picture of your customer base. You could end up spending time and
money promoting a product to an audience that would never even consider
purchasing it.
Similar to data accuracy, data profiling is a vital process in assessing your
completeness.
Related: Emerging Data Quality Frameworks
The other significant way to assess your completeness is with outlier analysis:
● Outlier analysis is the process of identifying data points significantly
different from the majority of data points in the same dataset. This
process is useful because you can narrow your analysis solely to the
data that stands out.
Consistency
When we talk about consistency, we’re talking about the uniformity of your
data across various sources and platforms.
In software development, the consistency of the data will make a huge
difference in the ability to utilize and process this data.
And at the decision-making level, consistency is important as you can directly
compare data accurately without being confused by inconsistent or
contradictory data.
For example, when referencing the United States, it's possible to input the
name of the country in multiple ways, such as United States, United States of
America, North America, USA, or US. If various stakeholders use different
definitions, say 25% use US, 25% USA, and 50% United States, additional
processing steps are required before the data can be used for analytics.
Another example would be if the monthly profit number is inconsistent with the
monthly revenue and cost numbers.
There are a couple of approaches for measuring the consistency of your data
—lineage analysis and cross-validation.
● Data lineage involves tracking the origin and movement of the data
across different locations and systems. The idea is to know exactly
where your data has come from and identify any changes along the
way.
● Cross-validation requires analysts to compare data from different
sources to assess consistency.
Are you facing challenges in comprehending the flow of data within your
cloud-based or on-premises infrastructure? Download our Whitepaper
How to Build Data Lineage to Improve Quality and Enhance Trust
Validity
The validity of your data refers to the adherence to defined data formats and
constraints.
For example, if you are storing dates, you need to define how these dates will
be formatted. If you decide to use mm/dd/yy, then 04/18/1987 will be valid, but
18/04/1987 won’t.
Validity is significant because if you store many invalidly formatted data, you
won’t be able to analyze or compare it easily. This is because invalid data
would most likely be omitted from the results completely.
You can assess the validity of your data using data validation rules and
schema checks.
● Data validation rules: These are a predefined set of rules to which the
data can be compared. This is useful at the point of data creation.
● Schema checks: This process checks data against predefined data
structures and criteria. This is more useful when reviewing and
assessing pre-existing data.
Timeliness
Assessing timeliness involves measuring how up-to-date and relevant the
data is. For example, if you have data about your company’s finances, it’s
important to know if it’s from last week, or from 2001.
This is significant in your decision-making process because things change all
the time, and you need to be confident you’re not basing important business
decisions on stale information.
The best ways to assess the timeliness of your data is with data freshness
analysis and data source monitoring.
● Data freshness analysis: This involves analyzing how recently the
data was uploaded, or updated. Data
● Data source monitoring: This process concerns monitoring your data
sources, so you can flag dated data and seek to update it.
Uniqueness
Data uniqueness is the measure of how distinct and non-repetitive data values
are. Or, to put it in simpler terms, identifying and removing duplicate
information and records.
This is important because duplicate data can conflate or mislead your
decision-making process. For example, if you have a sale recorded twice,
then it will look like you’ve made more money than you really have, skewing
your profit/loss calculations and projections.
You can avoid this using duplicate detection and data deduplication.
● Duplicate detection: This is about assessing and identifying duplicate
records on existing or new data.
● Data deduplication: Often used alongside duplicate detection, this
involves manually or automatically removing any duplicate data from a
dataset.
Integrity
The integrity of your data is the measure of your data being accurate,
consistent, and reliable over time. For example, you might get your data to a
certain quality now, but over the next few years, this could deteriorate as more
data is added, modified, and deleted.
This is important to measure because if your integrity degrades, then your
decision-making will inevitably suffer too.
Part of this assessment is the continued use of data validation rules as we
covered for validity, but you’ll also want to carry out referential integrity
checks.
● Referential integrity checks: This is about ensuring that changes
made to data in one place are carried on and reflected in other places.
1.4Data Types, Measure of central tendency, Measures of dispersion
Data Types refers to the different categories of information that can be
collected, such as numerical (discrete or continuous), categorical (ordinal,
nominal) or text. Examples include age, gender, temperature, and favorite
color.
Measures of central tendency describe the "typical" or "center" value of a
dataset. The most common measures of central tendency are:
● Mean: The sum of all values divided by the number of values
●
● Median: The value that separates the higher half of the data from the lower half
when the data is ordered
●
● Mode: The value that occurs most frequently in the data
●
Measures of dispersion indicate how spread out or distributed the data is
around the central tendency. Some common measures of dispersion include:
● Range: The difference between the highest and lowest values in the data set
●
● Variance: The average of the squared deviations from the mean
●
● Standard deviation: The square root of the variance
●
● Interquartile range (IQR): The difference between the third quartile and the first
quartile
●
Key points:
● Measures of central tendency tell us about the "typical" value in a dataset, while
measures of dispersion tell us how spread out the data is around that typical
value.
●
● Different measures of central tendency and dispersion may be more appropriate
depending on the characteristics of the data and the research question. For
example, the mean is sensitive to outliers, while the median is not.
●
● Choosing the right measure of central tendency and dispersion is crucial for
accurately interpreting and describing data.
What is Central Tendency?
Central tendency is a statistical concept that refers to the tendency of data to
cluster around a central value or a typical value.
It identifies a single value as representative of an entire data distribution. In
other words, the central tendency is a way of describing the centre or
midpoint of a dataset. The three most common measures of central tendency
are:
● Mean
● Median
● Mode
Let's discuss these as follows:
Mean
The mean, also known as the average, is calculated by summing up all the
values in a dataset and dividing the sum by the total number of values. It is
sensitive to extreme values, making it susceptible to outliers.
Median
The median is the middle value in a dataset when the values are arranged in
ascending or descending order. If there is an even number of values, the
median is the average of the two middle values. Unlike the mean, the median
is less affected by outliers.
Mode
The mode is the value that occurs most frequently in a dataset. It is
particularly useful for categorical data but can also be applied to numerical
data. A dataset may have one mode (unimodal), two modes (bimodal), or
more than two modes (multimodal).
Note: The choice of which central tendency measure to use depends on the
properties of the data. For example, the mean is best for symmetric
distributions, while the median is better for skewed distributions with
outliers. The mode is useful for categorical data.
Let's consider a dataset of daily temperatures recorded over a week: 22°C,
23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.
● Mean: (22 + 23 + 21 + 25 + 22 + 24 + 20) / 7 = 21.86°C
● Median: Arranging the temperatures in ascending order: 20°C, 21°C,
22°C, 22°C, 23°C, 24°C, 25°C. The median is 22°C.
● Mode: The mode is 22°C as it occurs most frequently in the dataset.
Read More about Measure of Central Tendency.
What is Dispersion?
Dispersion, also known as variability or spread, measures the extent to which
individual data points deviate from the central value. It provides information
about the spread or distribution of data points in a dataset.
Common measures of dispersion include
● Range
● Variance
● Standard Deviation
● Interquartile Range (IQR)
Range
In statistics, the range refers to the difference between the highest and
lowest values in a dataset. It provides a simple measure of variability,
indicating the spread of data points. The range is calculated by subtracting
the lowest value from the highest value.
For example, in a dataset {4, 6, 9, 3, 7}, the range is 9 - 3 = 6.
Variance
Variance is a statistical measure that quantifies the amount of variation or
dispersion of a set of numbers from their mean value. Specifically, variance is
defined as the expected value of the squared deviation from the mean. It is
calculated by:
1. Finding the mean (average) of the data set.
2. Subtracting the mean from each data point to get the deviations
from the mean.
3. Squaring each of the deviations.
4. Calculating the average of the squared deviations. This is the
variance.
Standard Deviation
Standard deviation is a measure of the amount of variation or dispersion of a
set of values from the mean value. It is calculated as the square root of the
variance, which is the average squared deviation from the mean.
Interquartile Range (IQR)
Interquartile Range (IQR) is a measure of statistical dispersion that
represents the middle 50% of a data set. It is calculated as the difference
between the 75th percentile (Q3) and the 25th percentile (Q1) of the data i.e.,
IQR = Q3 − Q1.
Examples for Dispersion
Let's consider the same dataset of daily temperatures recorded over a week:
22°C, 23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.
Range: Maximum temperature - Minimum temperature = 25°C - 20°C = 5°C
Variance: Variance = (Sum of squared differences from the mean) / (Number
of data points)
Mean = 21.86 °C
Sum of squared differences from the mean = (22 - 21.86)2 + (23 - 21.86)2 +
(21 - 21.86)2 + (25 - 21.86)2 + (22 - 21.86)2 + (24 - 21.86)2 + (20 - 21.86)2
= (0.14)2 + (1.14)2 + (-0.86)2 + (3.14)2 + (0.14)2 + (2.14)2 + (-1.86)2
= 0.0196 + 1.2996 + 0.7396 + 9.8596 + 0.0196 + 4.5796 + 3.4596
= 19.0772
Thus, Variance = 19.0772 / 7 ≈ 2.725 °C
Standard Deviation: Take the square root of the variance to get the standard
deviation.
Thus, Standard Deviation ≈ √2.725 ≈ 1.65 °C
Interquartile Range (IQR): First Quartile (Q1) = 21°C Third Quartile (Q3) =
24°C
Thus, IQR = Q3 − Q1 = 24°C - 21°C = 3°C
Read More about Measure of Dispersion.
Differences between Central Tendency and Dispersion
The key differences between central tendency and dispersion are listed in the
following table:
Aspect Central Tendency Dispersion
Indicates the typical or
Indicates the spread or
central value
Definition variability of data
around which data
points in a dataset
tends to cluster
Describes how spread
Provides a single
out the values are
representative
Purpose from each other and
value summarizing
from the central
the dataset
value
Range, variance,
Examples Mean, median, mode standard deviation,
interquartile range
Calculation Calculated using the Calculated using
values of the deviations from the
dataset central value
(usually the mean)
Helps in
Helps in understanding
understanding the
Interpretation the spread of the
center of the data
data distribution
distribution
Indicates how widely
Measure of Indicates where the
the data is spread
Location data is centered
out
1.5 Sampling Funnel, Central Limit Theorem, Confidence Interval,
Sampling Variation
The Central Limit Theorem in Statistics states that as the sample size
increases and its variance is finite, then the distribution of the sample mean
approaches normal distribution irrespective of the shape of the population
distribution.
The central limit theorem posits that the distribution of sample means will
invariably conform to a normal distribution provided the sample size is
sufficiently large. This holds regardless of the underlying distribution of the
population, be it normal, Poisson, binomial, or any alternative distribution.
In this article on the Central Limit Theorem, we will learn about the central
limit theorem definition, examples, Formulas, proof of the Central Limit
Theorem, and its applications.
Central Limit Theorem in Statistics
Last Updated : 23 Sep, 2024
The Central Limit Theorem in Statistics states that as the sample size
increases and its variance is finite, then the distribution of the sample mean
approaches normal distribution irrespective of the shape of the population
distribution.
The central limit theorem posits that the distribution of sample means will
invariably conform to a normal distribution provided the sample size is
sufficiently large. This holds regardless of the underlying distribution of the
population, be it normal, Poisson, binomial, or any alternative distribution.
In this article on the Central Limit Theorem, we will learn about the central
limit theorem definition, examples, Formulas, proof of the Central Limit
Theorem, and its applications.
Table of Content
● Central Limit Theorem
● Central Limit Theorem Formula
● Central Limit Theorem Proof
● Central Limit Theorem Examples
● Assumptions of the Central Limit Theorem
● Steps to Solve Problems on Central Limit Theorem
○ Mean of the Sample Mean
○ Standard Deviation of the Sample Mean
● Central Limit Theorem Applications
Central Limit Theorem
The Central Limit Theorem explains that the sample distribution of the
sample mean resembles the normal distribution irrespective of the fact that
whether the variables themselves are distributed normally or not. The
Central Limit Theorem is often called CLT in abbreviated form.
Central Limit Theorem Definition:
The Central Limit Theorem states that:
When large samples usually greater than thirty are taken into
consideration then the distribution of sample arithmetic mean approaches
the normal distribution irrespective of the fact that random variables were
originally distributed normally or not.
Central Limit Theorem Formula
Let us assume we have a random variable X. Let σ be its standard deviation and μ
is the mean of the random variable. Now as per the Central Limit Theorem, the
sample mean
X‾
will approximate to the normal distribution which is given as
X‾
⁓ N(μ, σ/√n). The Z-Score of the random variable
X‾
is given as Z =
x‾−μσn
n
−μ
. Here
x‾
is the mean of
X‾
. The image of the formula is attached below.
Central Limit Theorem Proof
Let the independent random variables be X1, X2, X3, . . . . , Xn which are
identically distributed and where their mean is zero(μ = 0) and their variance is
one(σ2 = 1).
The Z score is given as, Z =
X‾−μσn
n
−μ
Here, according to Central Limit Theorem, Z approximates to Normal
Distribution as the value of n increases.
Let m(t) be the Moment Generating Function of Xi
⇒ M(0) = 1
⇒ M'(1) = E(Xi) = μ = 0
⇒ M''(0) = E(Xi2) = 1
The Moment Generating Function for Xi/√n is given as E[e tXi/√n]
Since, X1 X2, X3 . . . Xn are independent, hence the Moment Generating
Function for (X1 + X2 + X3 + . . . + Xn)/√n is given as [M(t/√n)]n
Let us assume as function
f(t) = log M(t)
⇒ f(0) = log M(0) = 0
⇒ f'(0) = M'(0)/M(0) = μ/1 = μ
⇒ f''(0) = (M(0).M"(0) - M'(0)2)/M'(0)2 = 1
Now, using L' Hospital Rule we will find t/√n as t 2/2
⇒ [M(t/√n)]2 = [ef(t/√n)]n
⇒ [enf(t/√n)] = e^(t2/2)
Thus the Central Limit Theorem has been proved by getting Moment
Generating Function of a Standard Normal Distribution.
Central Limit Theorem Examples
Let's say we have a large sample of observations and each sample is
randomly produced and independent of other observations. Calculate the
average of the observations, thus having a collection of averages of
observations. Now as per the Central Limit Theorem, if the sample size is
adequately large, then the probability distribution of these sample averages
will approximate to a normal distribution.
Assumptions of the Central Limit Theorem
The Central Limit Theorem is valid for the following conditions:
● The drawing of the sample from the population should be random.
● The drawing of the sample should be independent of each other.
● The sample size should not exceed ten percent of the total
population when sampling is done without replacement.
● Sample Size should be adequately large.
● CLT only holds for a population with finite variance.
Steps to Solve Problems on Central Limit Theorem
Problems of Central Limit Theorem that involves >, < or between can be
solved by the following steps:
● Step 1: First identify the >, < associated with sample size,
population size, mean and variance in the problem. Also there can
be 'betwee; associated with range of two numbers.
● Step 2: Draw a Graph with Mean as Centre
● Step 3: Find the Z-Score using the formula
● Step 4: Refer to the Z table to find the value of Z obtained in the
previous step.
● Step 5: If the problem involves '>' subtract the Z score from 0.5; if
the problem involves '<' add 0.5 to the Z score and if the problem
involves 'between' then perform only step 3 and 4.
● Step 6: The Z score value is found along
● X‾
● X
●
● Step 7: Convert the decimal value obtained in all three cases to
decimal.
Mean of the Sample Mean
According to the Central Limit Theorem:
● If you have a population with a mean μ, the mean of the sample means
(also called the expected value of the sample mean) will be equal to the
population mean:
E(Xˉ)
E(
)=μ
Standard Deviation of the Sample Mean
The standard deviation of the sample mean (often called the standard error)
describes how much the sample mean is expected to vary from the true population
mean. It is calculated using the population standard deviation σ\sigmaσ and the
sample size n:
σXˉ =
σn
Central Limit Theorem Applications
Central Limit Theorem is generally used to predict the characteristics of a
population from a set of sample. It can be applied in various fields. Some of
the applications Central Limit Theorem are mentioned below:
● Central Limit Theorem is used by Economist and Data Scientist to
draw conclusion about population to make a statistical model.
● Central Limit Theorem is used by Biologists to make accurate
predictions about the characteristics of the population from set of
sample.
● Manufacturing Industries use Central Limit Theorem to predict
overall defective items produced by selecting random products from
a sample.
● Central Limit Theorem is used in surveys to predict the
characteristics of the population or to predict the average response
of the population by analyzing a sample of obtained responses.
● CLT can be used in Machine Learning to make conclusion about the
performance of the model.
Confidence Interval (CI) is a range of values that estimates where the true
population value is likely to fall. Instead of just saying The average height
of students is 165 cm a confidence interval allow us to say We are 95%
confident that the true average height is between 160 cm and 170 cm.
Before diving into confidence intervals you should be familiar with:
● t-test
● z-test
Interpreting Confidence Intervals
Let's say we take a sample of 50 students and calculate a 95% confidence
interval for their average height which turns out to be 160–170 cm. This
means If we repeatedly take similar samples 95% of those intervals would
contain the true average height of all students in the population.
Confidence Interval
Confidence level tells us how sure we are that the true value is within a
calculated range. If we have to repeat the sampling process many times we
expect that a certain percentage of those intervals will include the true value.
● 90% Confidence: 90% of intervals would include the true
population value.
● 95% Confidence: 95% of intervals would include the true value
which is commonly used in data science.
● 99% Confidence: 99% of intervals would include the true value but
the intervals would be wider.
Why are Confidence Intervals Important in Data Science?
● They helps to measure uncertainty in predictions and estimates.
● Through this data scientists finds the reliable results instead of just
giving a single number.
● They are widely used in A/B testing, machine learning, and survey
analysis which we study later to check if results are meaningful.
Steps for Constructing a Confidence Interval
To calculate a confidence interval follow these simple 4 steps:
Step 1: Identify the sample problem.
Define the population parameter you want to estimate e.g., mean height of
students. Choose the right statistic such as the sample mean.
Step 2: Select a confidence level.
In this step we select the confidence level some common choices are 90%,
95% or 99%. It represents how sure we are about our estimate.
Step 3: Find the margin of error.
To find the Margin of Error, you use the formula:
Margin of Error=Critical Value×Standard Error
Margin of Error=Critical Value×Standard Error
The Critical Value is found using Z-tables or T-tables for small samples.
First you choose the significance level (α) which is typically 0.05 for a 95%
confidence level. Then decide whether you are performing a one-tailed or
two-tailed test with two-tailed being the more common choice. After this
you look the corresponding value in the Z-table or T-table based on your
significance level and test type.
The Standard Error measures the variability of the sample and calculated by
dividing the sample's standard deviation by the square root of the sample
size. Combining the Critical Value and Standard Error gives you the Margin
of Error which tells you the range within which you expect the true value to
fall.
Step 4: Specify the confidence interval.
To find a Confidence Interval, we use this formula:
Confidence Interval=Point Estimate±Margin of Error
Confidence Interval=Point Estimate±Margin of Error
Now the Point Estimate is usually the average or mean from your sample.
It's the best guess of the true value based on the sample data. The Margin of
Error tells you how much the sample data might vary from the true value
that we have calculated in previous step.
So when you add or subtract the margin of error from your point estimate
you get a range. This range tells you where the true value is likely to fall.
Types of Confidence Intervals
Some of the common types of Confidence Intervals are:
Types of confidence Interval
1. Confidence Interval for the Mean of Normally Distributed Data
When we want to find the mean of a population based on a sample we use
this method.
● If the sample size is small (less than 30) we use the T-distribution
because small samples tend to have more variability.
● If the sample size is large (more than 30) then we use the Z-
distribution because large samples tend to give more accurate
estimates.
2. Confidence Interval for Proportions
This type is used when estimating population proportions like the percentage
of people who like a product. Here we use the sample proportion, the
standard error and the critical Z-value to calculate the interval. It gives us
the idea where the real value could fall based on sample data.
3. Confidence Interval for Non-Normally Distributed Data
Sometimes the data you have isn’t normally distributed meaning it doesn’t
follow a bell-shaped curve. In such cases traditional confidence intervals is
not the best method. Instead we can use bootstrap methods. This involves
resampling the data many times to create different samples and then
calculating the confidence interval from those resamples.
Calculating Confidence Interval
After understanding t-tests and z-tests now we move towards how to
calculate Confidence Intervals. To calculate a confidence interval you need
two key statistics:
● Mean (μ) — Arithmetic mean is the average of numbers. It is defined
as the sum of n numbers divided by the count of numbers till n.
● μ=1+2+3+…+nn
● μ=
● n
● 1+2+3+…+n
●
●
● Standard deviation (σ) — It is the measure of how spread out the
numbers are. It is defined as the summation of squared of the
difference between each number and the mean.
● σ=∑(xi−μ)2n
● σ=
● ∑
● n
● (x
● i
●
● −μ)
● 2
●
●
●
Once you have these you can calculate the confidence interval either using t-
distribution or z-distribution depend on the sample size whether the
population standard deviation is known.
A) Using t-distribution
When your sample size is small (typically n < 30) and you don't know the
population standard deviation we use the t-distribution. This is commonly
seen in fields like A/B testing or when working with small datasets.
Consider the following example. A random sample of 10 UFC fighters was
taken and their weights were measured. The mean weight was found to be
240 kg. Construct a 95% confidence interval estimate for the mean weight.
The sample standard deviation was 25 kg. Find a confidence interval for a
sample for the true mean weight of all UFC fighters.
Step-by-Step Process:
● Degrees of Freedom (df):
For t-distribution we first calculate the degrees of freedom:
● df=n−1=10−1=9
● df=n−1=10−1=9
● Significance Level (α):
The confidence level (CL) is 95% so the significance level is:
● α=1−CL2=1−0.952=0.025
● α=
● 2
● 1−CL
●
● =
● 2
● 1−0.95
●
● =0.025
● Find t-value from t-distribution table: From the t-table for df = 9 and
α = 0.025 the t-value is 2.262 which can be find using the below table.
(df)/(α) 0.1 0.05 0.025 ..
∞ 1.282 1.645 1.960 ..
1 3.078 6.314 12.706 ..
2 1.886 2.920 4.303 ..
: : : : ..
8 1.397 1.860 2.306 ..
9 1.383 1.833 2.262 ..
● Apply t-value in the formula:
The formula for the confidence interval is:
● μ±t(σn)
● μ ±t(
● n
●
● σ
●
● ) Using the values:
● 240±2.262×(2510)
● 240±2.262×(
● 10
●
● 25
●
● )
● The confidence interval becomes:
● (222.117,257.883)
● (222.117,257.883)
Therefore we are 95% confident that the true mean weight of UFC fighters is
between 222.117 kg and 257.883 kg.
This can be calculated using Python’s scipy.stats library to find the t-value
and perform the necessary calculations. The stats module provides various
statistical functions, probability distributions, and statistical tests.
import scipy.stats as stats
import math
sample_mean = 240
sample_std_dev = 25
sample_size = 10
confidence_level = 0.95
df = sample_size - 1
alpha = (1 - confidence_level) / 2
t_value = stats.t.ppf(1 - alpha, df)
margin_of_error = t_value * (sample_std_dev /
math.sqrt(sample_size))
lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error
print(f"Confidence Interval: ({lower_limit}, {upper_limit})")
Output:
Confidence Interval: (222.1160773511857, 257.8839226488143)
B) Using Z-distribution
When the sample size is large (n > 30) or the population standard deviation
is known the z-distribution is used. This is common in large-scale surveys,
market research.
Consider the following example. A random sample of 50 adult females was
taken and their RBC count is measured. The sample mean is 4.63 and the
standard deviation of RBC count is 0.54. Construct a 95% confidence interval
estimate for the true mean RBC count in adult females.
Step-by-Step Process:
1. Find the mean and standard deviation given in the problem.
2. Find the z-value for the confidence level:
For a 95% confidence interval the z-value is 1.960.
3. Apply z-value in the formula:
4. μ±z(σn)
5. μ±z(
6. n
7.
8. σ
9.
10. )
Using the values: some common values in the table given below:
Confidence Interval z-value
90% 1.645
95% 1.960
99% 2.576
The confidence interval becomes: (4.480,4.780)
Therefore we are 95% confident that the true mean RBC count for adult
females is between 4.480 and 4.780.
Now let's do the implementation of it using Python (stats library)
from scipy import stats
import numpy as np
sample_mean = 4.63
std_dev = 0.54
sample_size = 50
confidence_level = 0.95
standard_error = std_dev / np.sqrt(sample_size)
z_value = 1.960
margin_of_error = z_value * (std_dev / math.sqrt(sample_size))
lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error
print(f"Confidence Interval: ({lower_limit:.3f},
{upper_limit:.3f})")
Output:
Confidence Interval: (4.480, 4.780)
Some KeyTakeaways from Confidence Interval are:
● Confidence Intervals are essential in data science to find the
uncertainty of estimates and make predictions more reliable.
● t-distribution is used for small sample sizes (n < 30) while z-
distribution is used for large sample sizes (n > 30).
● Confidence intervals help to make data-driven decisions by
providing a range instead of a single point estimate. This is
especially important in A/B testing, market research and machine
learning.
Sampling variation refers to the differences observed when multiple samples
are drawn from the same population. It's a natural consequence of taking a
subset of a larger group, as each sample is likely to have slightly different
characteristics than the population as a whole. Understanding sampling
variation is crucial in statistics for interpreting sample data and making
inferences about the population.
Elaboration:
Definition:
Sampling variation is the inherent variability that exists when different samples are
drawn from the same population. It's the fact that each sample will likely have a
slightly different makeup than the population, and even different samples will have
differences among them.
Why it matters:
In statistics, sampling variation is a key concept because it helps researchers
understand the uncertainty associated with sample data. By acknowledging and
quantifying sampling variation, researchers can make more accurate inferences
about the population based on the sample data.
Types of sampling variation:
There are different sources of sampling variation, including:
● Random variation: This is the natural fluctuation that occurs due to chance.
●
● Systematic variation: This occurs when a particular sample is consistently
different from the population, often due to bias in the sampling method.
●
Mitigating sampling variation:
Researchers can take steps to minimize sampling variation, including:
● Increasing sample size: Larger samples tend to have less variability, as
they are more representative of the population.
●
● Using appropriate sampling methods: Techniques like stratified sampling
can help to reduce variation within specific subgroups of the population.
●
● Using statistical tests: Researchers can use statistical tests to assess the
magnitude of sampling variation and determine the likelihood that observed
differences are due to chance or real population differences.
●
Examples:
● If you take several samples of students in a school to estimate the average
height, each sample will likely have slightly different average heights. This is
due to sampling variation.
●
● If you are testing a new drug, the effectiveness of the drug in different samples
of patients may vary, even if the drug is truly effective. This is due to sampling
variation.