0% found this document useful (0 votes)

55 views50 pages

Unit 1

The document provides an overview of data analytics, emphasizing its role in transforming raw data into actionable insights through various techniques such as predictive, descriptive, prescriptive, and diagnostic analytics. It outlines the importance of data quality and quantity in decision-making processes and details the data analytics lifecycle, which includes phases from discovery to operationalization. Additionally, it discusses the significance of metrics like accuracy, completeness, consistency, and validity in assessing data quality.

Uploaded by

sonali pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views50 pages

Unit 1

Uploaded by

sonali pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Unit - I Introduction to Data Analytics

1.1 Data Analytics:

Data analytics converts raw data into actionable insights. It includes a range
of tools, technologies, and processes used to find trends and solve problems
by using data. Data analytics can shape business processes, improve decision-
making, a this type of analysis helps describe or summaries quantitative data
by presenting statistics.

For example, descriptive statistical analysis could show sales distribution

across a group of employees and the average sales figure per employee.
Descriptive analysis answers the question, “What happened and foster
business growth.

Data analytics is an important field that involves the process of collecting,

processing, and interpreting data to uncover insights and help in making
decisions. Data analytics is the practice of examining raw data to identify
trends, draw conclusions, and extract meaningful information. This
involves various techniques and tools to process and transform data into
valuable insights that can be used for decision-making.

NEED OF DATA ANALYTICS:

The use of data analytics in product development is a reliable understanding
of future requirements. The company will understand the current market
situation of the product. They can use the techniques to develop new products
as per market requirements. The ability to make data-driven decisions can
give organizations a competitive edge in their markets. Data analysts are
essential for leveraging the power of data. They use data and turn it into
meaningful insights that can drive better decision-making.
Importance of Data Analytics
Data analytics is important because it helps businesses optimize their
performances. Implementing it into the business model means companies can
help reduce costs by identifying more efficient ways of doing business and by
storing large amounts of data.
Improved Decision-Making – If we will have supporting data in favor of a
decision that then we will be able to implement them with even more success
probability. For example, if a certain decision or plan has to lead to better
outcomes then there will be no doubt in implementing them again.
Better Customer Service – Churn modeling is the best example of this in
which we try to predict or identify what leads to customer churn and change
those things accordingly so, that the attrition of the customers is as low as
possible which is a most important factor in any organization.
Efficient Operations – Data Analytics can help us understand what the
demand of the situation is and what should be done to get better results then
we will be able to streamline our processes which in turn will lead to efficient
operations.
Effective Marketing – Market segmentation techniques have been
implemented to target this important factor only in which we are supposed to
find the marketing techniques which will help us increase our sales and leads
to effective marketing strategies.

DATA ANALYTICS
Analytics is the discovery and communication of meaningful patterns in
data. Especially, valuable in areas rich with recorded information, analytics
relies on the simultaneous application of statistics, computer programming,
and operation research to qualify performance. Analytics often favors data
visualization to communicate insight. Firms may commonly apply analytics
to business data, to describe, predict, and improve business performance.
Especially, areas within include predictive analytics, enterprise decision
management, etc. Since analytics can require extensive computation (because
of big data), algorithms and software harness the most current methods in
computer science. Data Analytics aims to get actionable insights resulting in
smarter decisions and better business outcomes. It is critical to design and
built a data warehouse or Business Intelligence (BI) architecture that provides
a flexible, multi-faceted analytical ecosystem, optimized for efficient
ingestion and analysis of large and diverse data sets.
What is Data Analytics? In this new digital world, data is being generated in
an enormous amount which opens new paradigms. As we have high
computing power as well as a large amount of data we can make use of this
data to help us make data-driven decision making. The main benefits of data-
driven decisions are that they are made up by observing past trends which
have resulted in beneficial results. In short, we can say that data analytics is
the process of manipulating data to extract useful trends and hidden patterns
which can help us derive valuable insights to make business predictions

1.2TYPES / CATEGORIES/ MODELS OF DATA ANALYTICS

There are four major types of data analytics:

1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics Predictive analytics turn the data into valuable,
actionable information. Predictive analytics uses data to determine the
probable outcome of an event or a likelihood of a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current and
historical facts to make predictions about a future event.
Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Corner Stone’s of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling

Descriptive Analytics Descriptive analytics looks at data and analyze past

event for insight as to how to approach future events. It looks at past
performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often
used to classify customers or prospects into groups. Unlike a predictive
model that focuses on predicting the behavior of a single customer,
Descriptive analytics identifies many different relationships between
customer and product.
Common examples of Descriptive analytics are company reports that provide
historic reviews like:
● Data Queries
● Reports
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical
science, business rule, and machine learning to make a prediction and then
suggests a decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also
suggesting action benefits from the predictions and showing the decision
maker the implication of each decision option.
Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can
suggest decision options on how to take advantage of a future opportunity or
mitigate a future risk and illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning
by using analytics to leverage operational and usage data combined with data
of external factors such as economic data, population demography, etc.

Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency
and pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight
into a problem, and they also keep detailed information about their disposal
otherwise data collection may turn out individual for every problem and it
will be very time-consuming. Common techniques used for Diagnostic
Analytics are:
Data discovery
Data mining
Correlations

1.3 Life cycle of Data Analytics, Quality and Quantity of data,

Measurement

Data Analytics Lifecycle :

The Data analytic lifecycle is designed for Big Data problems and data
science projects. The cycle is iterative to represent real project. To address
the distinct requirements for performing analysis on Big Data, step–by–
step methodology is needed to organize the activities and tasks involved
with acquiring, processing, analyzing, and repurposing data.
● Phase 1: Discovery –

● The data science team learns and investigates the problem.

● Develop context and understanding.

● Come to know about data sources needed and available for the

project.

● The team formulates the initial hypothesis that can be later

tested with data.

● Phase 2: Data Preparation -

● Steps to explore, preprocess, and condition data before modeling

and analysis.
● It requires the presence of an analytic sandbox, the team

executes, loads, and transforms, to get data into the sandbox.

● Data preparation tasks are likely to be performed multiple times

and not in predefined order.

● Several tools commonly used for this phase are - Hadoop, Alpine

Miner, Open Refine, etc.

● Phase 3: Model Planning -

● The team explores data to learn about relationships between

variables and subsequently, selects key variables and the most

suitable models.

● In this phase, the data science team develops data sets for

training, testing, and production purposes.

● Team builds and executes models based on the work done in the

model planning phase.

● Several tools commonly used for this phase are - Matlab and

STASTICA.

● Phase 4: Model Building -

● Team develops datasets for testing, training, and production

purposes.

● Team also considers whether its existing tools will suffice for

running the models or if they need more robust environment for

executing models.
● Free or open-source tools - Rand PL/R, Octave, WEKA.

● Commercial tools - Matlab and STASTICA.

● Phase 5: Communication Results -

● After executing model team need to compare outcomes of

modeling to criteria established for success and failure.

● Team considers how best to articulate findings and outcomes to

various team members and stakeholders, taking into account

warning, assumptions.

● Team should identify key findings, quantify business value, and

develop narrative to summarize and convey findings to

stakeholders.

● Phase 6: Operationalize -

● The team communicates benefits of project more broadly and

sets up pilot project to deploy work in controlled way before

broadening the work to full enterprise of users.

● This approach enables team to learn about performance and

related constraints of the model in production environment on

small scale which make adjustments before full deployment.

● The team delivers final reports, briefings, codes.

● Free or open source tools - Octave, WEKA, SQL, MADlib.

Quality and Quantity of data, Measurement
Data quality and quantity are crucial aspects of measurement, influencing the
reliability and validity of research and decision-making. Data quality refers to
the accuracy, completeness, consistency, and timeliness of the data, while
data quantity refers to the amount of data collected.

● Data Quality:
Accuracy: How closely the data reflects the true value or event.

Completeness: The extent to which all necessary information is present.

Consistency: The degree to which data values are uniform and free from
contradictions.

Timeliness: The freshness and relevance of the data.

Relevance: How well the data aligns with the intended purpose of analysis.
Validity: Whether the data is appropriate for the research question or
decision-making process.

Uniqueness: The absence of duplicate or redundant entries.

Integrity: The assurance that the data is accurate and reliable.

● Data Quantity:
Sample Size: The number of observations or participants in a study.
Data Coverage: The extent to which the data represents the population or
phenomenon being studied.
Data Depth: The level of detail and granularity of the data.

● Importance of Data Quality and Quantity:

Reliability: High-quality data ensures the reliability of research findings and
decision-making.

Validity: The validity of conclusions depends on the quality and quantity of

the data used.

Generalizability: Adequate data quantity and quality allow for the

generalizability of findings to larger populations or contexts.

Informed Decision-Making: Quality data provides a solid foundation for

making informed decisions.

Cost-Effectiveness: Poor data quality can lead to wasted resources and

inaccurate conclusions.

● Measuring Data Quality and Quantity:

Data Quality Metrics:
Specific metrics are used to assess the quality of data, such as accuracy
rates, completeness percentages, and consistency scores.
Data Validation Rules:
Rules are established to ensure that data meets predefined quality
standards.
Data Profiling:
Tools and techniques are used to analyze data and identify potential quality
issues.
Data Audits:
Regular audits are conducted to ensure data quality and integrity.

he role of data integrity vs. data quality is often confusing. Data quality
focuses on accuracy, completeness, and other attributes to make sure
that data is reliable. Data integrity, on the other hand, makes this
reliable data useful. It adds relationships and context to enrich data for
improving its effectiveness.
The difference between data integrity and data quality is in the level of
value they offer. Data quality works as the foundation for trusted
business decisions, while data integrity takes it one notch higher to
deliver better business decisions.

7 Metrics & Their Measurement

Techniques
Here, we will walk you through the seven dimensions you need to consider
when assessing data quality in your organization. They are:

1. Accuracy

2. Completeness

3. Consistency

4. Validity

5. Timeliness

6. Uniqueness

7. Integrity
Accuracy
Data accuracy measures the correctness and completeness of specific data.
Essentially, you’re evaluating how much the information accurately reflects the
event or object described.

The accuracy of your data is essential to avoid problems with transaction

processing or generating faulty results in analytics.

Your decision-making is also dependent on high accuracy, as incorrect data

will likely result in poor decisions.

Imagine forecasting your manufacturing budget based on data that

inaccurately represents the number of units sold the previous year. A mistake
like this could have serious financial consequences in the short and long term.

Two practical ways of evaluating data accuracy are data profiling and
validation:

● Data profiling is the process of analyzing your data to identify

inconsistencies, errors, and other anomalies.

● Data validation assesses your data to ensure it is accurate, complete,

and consistent with each other by checking it against predefined criteria.
This process can be done manually, but it’s usually automated.

Similar to data accuracy, data profiling is a vital process in assessing your

completeness.

Related: Emerging Data Quality Frameworks

The other significant way to assess your completeness is with outlier analysis:

● Outlier analysis is the process of identifying data points significantly

different from the majority of data points in the same dataset. This
process is useful because you can narrow your analysis solely to the
data that stands out.

Consistency
When we talk about consistency, we’re talking about the uniformity of your
data across various sources and platforms.
In software development, the consistency of the data will make a huge
difference in the ability to utilize and process this data.

And at the decision-making level, consistency is important as you can directly

compare data accurately without being confused by inconsistent or
contradictory data.

For example, when referencing the United States, it's possible to input the
name of the country in multiple ways, such as United States, United States of
America, North America, USA, or US. If various stakeholders use different
definitions, say 25% use US, 25% USA, and 50% United States, additional
processing steps are required before the data can be used for analytics.

Another example would be if the monthly profit number is inconsistent with the
monthly revenue and cost numbers.

There are a couple of approaches for measuring the consistency of your data
—lineage analysis and cross-validation.

● Data lineage involves tracking the origin and movement of the data
across different locations and systems. The idea is to know exactly
where your data has come from and identify any changes along the
way.

● Cross-validation requires analysts to compare data from different

sources to assess consistency.
Are you facing challenges in comprehending the flow of data within your
cloud-based or on-premises infrastructure? Download our Whitepaper
How to Build Data Lineage to Improve Quality and Enhance Trust

Validity
The validity of your data refers to the adherence to defined data formats and
constraints.

For example, if you are storing dates, you need to define how these dates will
be formatted. If you decide to use mm/dd/yy, then 04/18/1987 will be valid, but
18/04/1987 won’t.

Validity is significant because if you store many invalidly formatted data, you
won’t be able to analyze or compare it easily. This is because invalid data
would most likely be omitted from the results completely.

You can assess the validity of your data using data validation rules and
schema checks.

● Data validation rules: These are a predefined set of rules to which the
data can be compared. This is useful at the point of data creation.

● Schema checks: This process checks data against predefined data

structures and criteria. This is more useful when reviewing and
assessing pre-existing data.

Timeliness
Assessing timeliness involves measuring how up-to-date and relevant the
data is. For example, if you have data about your company’s finances, it’s
important to know if it’s from last week, or from 2001.

This is significant in your decision-making process because things change all

the time, and you need to be confident you’re not basing important business
decisions on stale information.

The best ways to assess the timeliness of your data is with data freshness
analysis and data source monitoring.

● Data freshness analysis: This involves analyzing how recently the

data was uploaded, or updated. Data

● Data source monitoring: This process concerns monitoring your data

sources, so you can flag dated data and seek to update it.

Uniqueness
Data uniqueness is the measure of how distinct and non-repetitive data values
are. Or, to put it in simpler terms, identifying and removing duplicate
information and records.

This is important because duplicate data can conflate or mislead your

decision-making process. For example, if you have a sale recorded twice,
then it will look like you’ve made more money than you really have, skewing
your profit/loss calculations and projections.

You can avoid this using duplicate detection and data deduplication.
● Duplicate detection: This is about assessing and identifying duplicate
records on existing or new data.

● Data deduplication: Often used alongside duplicate detection, this

involves manually or automatically removing any duplicate data from a
dataset.

Integrity
The integrity of your data is the measure of your data being accurate,
consistent, and reliable over time. For example, you might get your data to a
certain quality now, but over the next few years, this could deteriorate as more
data is added, modified, and deleted.

This is important to measure because if your integrity degrades, then your

decision-making will inevitably suffer too.

Part of this assessment is the continued use of data validation rules as we

covered for validity, but you’ll also want to carry out referential integrity
checks.

● Referential integrity checks: This is about ensuring that changes

made to data in one place are carried on and reflected in other places.

1.4Data Types, Measure of central tendency, Measures of dispersion

Data Types refers to the different categories of information that can be
collected, such as numerical (discrete or continuous), categorical (ordinal,
nominal) or text. Examples include age, gender, temperature, and favorite
color.
Measures of central tendency describe the "typical" or "center" value of a
dataset. The most common measures of central tendency are:

● Mean: The sum of all values divided by the number of values

●
● Median: The value that separates the higher half of the data from the lower half
when the data is ordered
●
● Mode: The value that occurs most frequently in the data
●

Measures of dispersion indicate how spread out or distributed the data is

around the central tendency. Some common measures of dispersion include:

● Range: The difference between the highest and lowest values in the data set
●
● Variance: The average of the squared deviations from the mean
●
● Standard deviation: The square root of the variance
●
● Interquartile range (IQR): The difference between the third quartile and the first
quartile
●

Key points:

● Measures of central tendency tell us about the "typical" value in a dataset, while
measures of dispersion tell us how spread out the data is around that typical
value.
●
● Different measures of central tendency and dispersion may be more appropriate
depending on the characteristics of the data and the research question. For
example, the mean is sensitive to outliers, while the median is not.
●
● Choosing the right measure of central tendency and dispersion is crucial for
accurately interpreting and describing data.
What is Central Tendency?

Central tendency is a statistical concept that refers to the tendency of data to

cluster around a central value or a typical value.

It identifies a single value as representative of an entire data distribution. In

other words, the central tendency is a way of describing the centre or
midpoint of a dataset. The three most common measures of central tendency
are:

● Mean

● Median

● Mode

Let's discuss these as follows:

Mean

The mean, also known as the average, is calculated by summing up all the
values in a dataset and dividing the sum by the total number of values. It is
sensitive to extreme values, making it susceptible to outliers.

Median

The median is the middle value in a dataset when the values are arranged in
ascending or descending order. If there is an even number of values, the
median is the average of the two middle values. Unlike the mean, the median
is less affected by outliers.

Mode

The mode is the value that occurs most frequently in a dataset. It is

particularly useful for categorical data but can also be applied to numerical
data. A dataset may have one mode (unimodal), two modes (bimodal), or
more than two modes (multimodal).

Note: The choice of which central tendency measure to use depends on the
properties of the data. For example, the mean is best for symmetric
distributions, while the median is better for skewed distributions with
outliers. The mode is useful for categorical data.

Let's consider a dataset of daily temperatures recorded over a week: 22°C,

23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.

● Mean: (22 + 23 + 21 + 25 + 22 + 24 + 20) / 7 = 21.86°C

● Median: Arranging the temperatures in ascending order: 20°C, 21°C,

22°C, 22°C, 23°C, 24°C, 25°C. The median is 22°C.

● Mode: The mode is 22°C as it occurs most frequently in the dataset.

Common measures of dispersion include

● Range

● Variance

● Standard Deviation

● Interquartile Range (IQR)

Range

In statistics, the range refers to the difference between the highest and
lowest values in a dataset. It provides a simple measure of variability,
indicating the spread of data points. The range is calculated by subtracting
the lowest value from the highest value.

For example, in a dataset {4, 6, 9, 3, 7}, the range is 9 - 3 = 6.

Variance

Variance is a statistical measure that quantifies the amount of variation or

dispersion of a set of numbers from their mean value. Specifically, variance is
defined as the expected value of the squared deviation from the mean. It is
calculated by:

1. Finding the mean (average) of the data set.

2. Subtracting the mean from each data point to get the deviations

from the mean.

3. Squaring each of the deviations.

4. Calculating the average of the squared deviations. This is the

variance.

Standard Deviation

Standard deviation is a measure of the amount of variation or dispersion of a

set of values from the mean value. It is calculated as the square root of the
variance, which is the average squared deviation from the mean.

Interquartile Range (IQR)

Interquartile Range (IQR) is a measure of statistical dispersion that

represents the middle 50% of a data set. It is calculated as the difference
between the 75th percentile (Q3) and the 25th percentile (Q1) of the data i.e.,
IQR = Q3 − Q1.

Examples for Dispersion

Let's consider the same dataset of daily temperatures recorded over a week:
22°C, 23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.

Range: Maximum temperature - Minimum temperature = 25°C - 20°C = 5°C

Variance: Variance = (Sum of squared differences from the mean) / (Number
of data points)

Mean = 21.86 °C

Sum of squared differences from the mean = (22 - 21.86)2 + (23 - 21.86)2 +
(21 - 21.86)2 + (25 - 21.86)2 + (22 - 21.86)2 + (24 - 21.86)2 + (20 - 21.86)2

= (0.14)2 + (1.14)2 + (-0.86)2 + (3.14)2 + (0.14)2 + (2.14)2 + (-1.86)2

= 0.0196 + 1.2996 + 0.7396 + 9.8596 + 0.0196 + 4.5796 + 3.4596

= 19.0772

Thus, Variance = 19.0772 / 7 ≈ 2.725 °C

Standard Deviation: Take the square root of the variance to get the standard
deviation.

Thus, Standard Deviation ≈ √2.725 ≈ 1.65 °C

Interquartile Range (IQR): First Quartile (Q1) = 21°C Third Quartile (Q3) =
24°C

Thus, IQR = Q3 − Q1 = 24°C - 21°C = 3°C

Differences between Central Tendency and Dispersion

The key differences between central tendency and dispersion are listed in the
following table:
Aspect Central Tendency Dispersion

Indicates the typical or

Indicates the spread or
central value
Definition variability of data
around which data
points in a dataset
tends to cluster

Describes how spread

Provides a single
out the values are
representative
Purpose from each other and
value summarizing
from the central
the dataset
value

Range, variance,
Examples Mean, median, mode standard deviation,
interquartile range

Calculation Calculated using the Calculated using

values of the deviations from the
dataset central value
(usually the mean)

Helps in
Helps in understanding
understanding the
Interpretation the spread of the
center of the data
data distribution
distribution

Indicates how widely

Measure of Indicates where the
the data is spread
Location data is centered
out

1.5 Sampling Funnel, Central Limit Theorem, Confidence Interval,

Sampling Variation

The Central Limit Theorem in Statistics states that as the sample size
increases and its variance is finite, then the distribution of the sample mean
approaches normal distribution irrespective of the shape of the population
distribution.

The central limit theorem posits that the distribution of sample means will
invariably conform to a normal distribution provided the sample size is
sufficiently large. This holds regardless of the underlying distribution of the
population, be it normal, Poisson, binomial, or any alternative distribution.

In this article on the Central Limit Theorem, we will learn about the central
limit theorem definition, examples, Formulas, proof of the Central Limit
Theorem, and its applications.

Central Limit Theorem in Statistics

Last Updated : 23 Sep, 2024

The Central Limit Theorem in Statistics states that as the sample size

increases and its variance is finite, then the distribution of the sample mean

approaches normal distribution irrespective of the shape of the population

distribution.

The central limit theorem posits that the distribution of sample means will

invariably conform to a normal distribution provided the sample size is

sufficiently large. This holds regardless of the underlying distribution of the

population, be it normal, Poisson, binomial, or any alternative distribution.

In this article on the Central Limit Theorem, we will learn about the central

limit theorem definition, examples, Formulas, proof of the Central Limit

Theorem, and its applications.

Table of Content

● Central Limit Theorem

● Central Limit Theorem Formula

● Central Limit Theorem Proof

● Central Limit Theorem Examples

● Assumptions of the Central Limit Theorem

● Steps to Solve Problems on Central Limit Theorem

○ Mean of the Sample Mean

○ Standard Deviation of the Sample Mean

● Central Limit Theorem Applications

Central Limit Theorem

The Central Limit Theorem explains that the sample distribution of the

sample mean resembles the normal distribution irrespective of the fact that

whether the variables themselves are distributed normally or not. The

Central Limit Theorem is often called CLT in abbreviated form.

Central Limit Theorem Definition:

The Central Limit Theorem states that:

When large samples usually greater than thirty are taken into
consideration then the distribution of sample arithmetic mean approaches
the normal distribution irrespective of the fact that random variables were
originally distributed normally or not.

Central Limit Theorem Formula

Let us assume we have a random variable X. Let σ be its standard deviation and μ
is the mean of the random variable. Now as per the Central Limit Theorem, the
sample mean

X‾

will approximate to the normal distribution which is given as

X‾

⁓ N(μ, σ/√n). The Z-Score of the random variable

X‾

is given as Z =

x‾−μσn
n

−μ

. Here
x‾

is the mean of

X‾

. The image of the formula is attached below.

Central Limit Theorem Proof

Let the independent random variables be X1, X2, X3, . . . . , Xn which are

identically distributed and where their mean is zero(μ = 0) and their variance is

one(σ2 = 1).
The Z score is given as, Z =

X‾−μσn
n

−μ

Here, according to Central Limit Theorem, Z approximates to Normal

Distribution as the value of n increases.

Let m(t) be the Moment Generating Function of Xi

⇒ M(0) = 1

⇒ M'(1) = E(Xi) = μ = 0

⇒ M''(0) = E(Xi2) = 1

The Moment Generating Function for Xi/√n is given as E[e tXi/√n]

Since, X1 X2, X3 . . . Xn are independent, hence the Moment Generating

Function for (X1 + X2 + X3 + . . . + Xn)/√n is given as [M(t/√n)]n

Let us assume as function

f(t) = log M(t)

⇒ f(0) = log M(0) = 0

⇒ f'(0) = M'(0)/M(0) = μ/1 = μ

⇒ f''(0) = (M(0).M"(0) - M'(0)2)/M'(0)2 = 1

Now, using L' Hospital Rule we will find t/√n as t 2/2

⇒ [M(t/√n)]2 = [ef(t/√n)]n

⇒ [enf(t/√n)] = e^(t2/2)

Thus the Central Limit Theorem has been proved by getting Moment

Generating Function of a Standard Normal Distribution.

Central Limit Theorem Examples

Let's say we have a large sample of observations and each sample is

randomly produced and independent of other observations. Calculate the

average of the observations, thus having a collection of averages of

observations. Now as per the Central Limit Theorem, if the sample size is

adequately large, then the probability distribution of these sample averages

will approximate to a normal distribution.

Assumptions of the Central Limit Theorem

The Central Limit Theorem is valid for the following conditions:

● The drawing of the sample from the population should be random.

● The drawing of the sample should be independent of each other.

● The sample size should not exceed ten percent of the total

population when sampling is done without replacement.

● Sample Size should be adequately large.

● CLT only holds for a population with finite variance.

Steps to Solve Problems on Central Limit Theorem

Problems of Central Limit Theorem that involves >, < or between can be

solved by the following steps:

● Step 1: First identify the >, < associated with sample size,

population size, mean and variance in the problem. Also there can

be 'betwee; associated with range of two numbers.

● Step 2: Draw a Graph with Mean as Centre

● Step 3: Find the Z-Score using the formula

● Step 4: Refer to the Z table to find the value of Z obtained in the

previous step.

● Step 5: If the problem involves '>' subtract the Z score from 0.5; if

the problem involves '<' add 0.5 to the Z score and if the problem

involves 'between' then perform only step 3 and 4.

● Step 6: The Z score value is found along

● X‾
● X
●
● Step 7: Convert the decimal value obtained in all three cases to

decimal.

Mean of the Sample Mean

According to the Central Limit Theorem:

● If you have a population with a mean μ, the mean of the sample means

(also called the expected value of the sample mean) will be equal to the

population mean:

E(Xˉ)

)=μ

Standard Deviation of the Sample Mean

The standard deviation of the sample mean (often called the standard error)

describes how much the sample mean is expected to vary from the true population
mean. It is calculated using the population standard deviation σ\sigmaσ and the

sample size n:

σXˉ =

σn

Central Limit Theorem Applications

Central Limit Theorem is generally used to predict the characteristics of a

population from a set of sample. It can be applied in various fields. Some of

the applications Central Limit Theorem are mentioned below:

● Central Limit Theorem is used by Economist and Data Scientist to

draw conclusion about population to make a statistical model.

● Central Limit Theorem is used by Biologists to make accurate

predictions about the characteristics of the population from set of

sample.
● Manufacturing Industries use Central Limit Theorem to predict

overall defective items produced by selecting random products from

a sample.

● Central Limit Theorem is used in surveys to predict the

characteristics of the population or to predict the average response

of the population by analyzing a sample of obtained responses.

● CLT can be used in Machine Learning to make conclusion about the

performance of the model.

Confidence Interval (CI) is a range of values that estimates where the true
population value is likely to fall. Instead of just saying The average height
of students is 165 cm a confidence interval allow us to say We are 95%
confident that the true average height is between 160 cm and 170 cm.
Before diving into confidence intervals you should be familiar with:
● t-test

● z-test

Interpreting Confidence Intervals

Let's say we take a sample of 50 students and calculate a 95% confidence
interval for their average height which turns out to be 160–170 cm. This
means If we repeatedly take similar samples 95% of those intervals would
contain the true average height of all students in the population.
Confidence Interval

Confidence level tells us how sure we are that the true value is within a
calculated range. If we have to repeat the sampling process many times we
expect that a certain percentage of those intervals will include the true value.
● 90% Confidence: 90% of intervals would include the true

population value.

● 95% Confidence: 95% of intervals would include the true value

which is commonly used in data science.

● 99% Confidence: 99% of intervals would include the true value but

the intervals would be wider.

Why are Confidence Intervals Important in Data Science?

● They helps to measure uncertainty in predictions and estimates.

● Through this data scientists finds the reliable results instead of just

giving a single number.

● They are widely used in A/B testing, machine learning, and survey

analysis which we study later to check if results are meaningful.

Steps for Constructing a Confidence Interval

To calculate a confidence interval follow these simple 4 steps:

Step 1: Identify the sample problem.

Define the population parameter you want to estimate e.g., mean height of
students. Choose the right statistic such as the sample mean.

Step 2: Select a confidence level.

In this step we select the confidence level some common choices are 90%,
95% or 99%. It represents how sure we are about our estimate.

Step 3: Find the margin of error.

To find the Margin of Error, you use the formula:

Margin of Error=Critical Value×Standard Error

The Critical Value is found using Z-tables or T-tables for small samples.
First you choose the significance level (α) which is typically 0.05 for a 95%
confidence level. Then decide whether you are performing a one-tailed or
two-tailed test with two-tailed being the more common choice. After this
you look the corresponding value in the Z-table or T-table based on your
significance level and test type.
The Standard Error measures the variability of the sample and calculated by
dividing the sample's standard deviation by the square root of the sample
size. Combining the Critical Value and Standard Error gives you the Margin
of Error which tells you the range within which you expect the true value to
fall.

Step 4: Specify the confidence interval.

To find a Confidence Interval, we use this formula:

Confidence Interval=Point Estimate±Margin of Error

Now the Point Estimate is usually the average or mean from your sample.
It's the best guess of the true value based on the sample data. The Margin of
Error tells you how much the sample data might vary from the true value
that we have calculated in previous step.
So when you add or subtract the margin of error from your point estimate
you get a range. This range tells you where the true value is likely to fall.

Types of Confidence Intervals

Some of the common types of Confidence Intervals are:

Types of confidence Interval

1. Confidence Interval for the Mean of Normally Distributed Data

When we want to find the mean of a population based on a sample we use
this method.
● If the sample size is small (less than 30) we use the T-distribution

because small samples tend to have more variability.

● If the sample size is large (more than 30) then we use the Z-

distribution because large samples tend to give more accurate

estimates.

2. Confidence Interval for Proportions

This type is used when estimating population proportions like the percentage
of people who like a product. Here we use the sample proportion, the
standard error and the critical Z-value to calculate the interval. It gives us
the idea where the real value could fall based on sample data.

3. Confidence Interval for Non-Normally Distributed Data

Sometimes the data you have isn’t normally distributed meaning it doesn’t
follow a bell-shaped curve. In such cases traditional confidence intervals is
not the best method. Instead we can use bootstrap methods. This involves
resampling the data many times to create different samples and then
calculating the confidence interval from those resamples.

Calculating Confidence Interval

After understanding t-tests and z-tests now we move towards how to
calculate Confidence Intervals. To calculate a confidence interval you need
two key statistics:
● Mean (μ) — Arithmetic mean is the average of numbers. It is defined
as the sum of n numbers divided by the count of numbers till n.

● μ=1+2+3+…+nn
● μ=
● n

● 1+2+3+…+n

●
●
● Standard deviation (σ) — It is the measure of how spread out the
numbers are. It is defined as the summation of squared of the
difference between each number and the mean.

● σ=∑(xi−μ)2n
● σ=
● ∑
● n

● (x

● i

●
● −μ)

● 2

●
●
●

Once you have these you can calculate the confidence interval either using t-
distribution or z-distribution depend on the sample size whether the
population standard deviation is known.

A) Using t-distribution

When your sample size is small (typically n < 30) and you don't know the
population standard deviation we use the t-distribution. This is commonly
seen in fields like A/B testing or when working with small datasets.
Consider the following example. A random sample of 10 UFC fighters was
taken and their weights were measured. The mean weight was found to be
240 kg. Construct a 95% confidence interval estimate for the mean weight.
The sample standard deviation was 25 kg. Find a confidence interval for a
sample for the true mean weight of all UFC fighters.
Step-by-Step Process:
● Degrees of Freedom (df):
For t-distribution we first calculate the degrees of freedom:
● df=n−1=10−1=9
● df=n−1=10−1=9

● Significance Level (α):

The confidence level (CL) is 95% so the significance level is:
● α=1−CL2=1−0.952=0.025
● α=
● 2

● 1−CL

●
● =
● 2

● 1−0.95

●
● =0.025

● Find t-value from t-distribution table: From the t-table for df = 9 and

α = 0.025 the t-value is 2.262 which can be find using the below table.

(df)/(α) 0.1 0.05 0.025 ..

∞ 1.282 1.645 1.960 ..

1 3.078 6.314 12.706 ..

2 1.886 2.920 4.303 ..

: : : : ..

8 1.397 1.860 2.306 ..

9 1.383 1.833 2.262 ..

● Apply t-value in the formula:

The formula for the confidence interval is:
● μ±t(σn)
● μ ±t(
● n
●
● σ

●
● ) Using the values:
● 240±2.262×(2510)
● 240±2.262×(
● 10

●
● 25

●
● )

● The confidence interval becomes:

● (222.117,257.883)
● (222.117,257.883)

Therefore we are 95% confident that the true mean weight of UFC fighters is
between 222.117 kg and 257.883 kg.
This can be calculated using Python’s scipy.stats library to find the t-value
and perform the necessary calculations. The stats module provides various
statistical functions, probability distributions, and statistical tests.
import scipy.stats as stats
import math

sample_mean = 240
sample_std_dev = 25
sample_size = 10
confidence_level = 0.95

df = sample_size - 1

alpha = (1 - confidence_level) / 2
t_value = stats.t.ppf(1 - alpha, df)

margin_of_error = t_value * (sample_std_dev /

math.sqrt(sample_size))

lower_limit = sample_mean - margin_of_error

upper_limit = sample_mean + margin_of_error

print(f"Confidence Interval: ({lower_limit}, {upper_limit})")

Output:
Confidence Interval: (222.1160773511857, 257.8839226488143)

B) Using Z-distribution

When the sample size is large (n > 30) or the population standard deviation
is known the z-distribution is used. This is common in large-scale surveys,
market research.
Consider the following example. A random sample of 50 adult females was
taken and their RBC count is measured. The sample mean is 4.63 and the
standard deviation of RBC count is 0.54. Construct a 95% confidence interval
estimate for the true mean RBC count in adult females.
Step-by-Step Process:
1. Find the mean and standard deviation given in the problem.

2. Find the z-value for the confidence level:

For a 95% confidence interval the z-value is 1.960.

3. Apply z-value in the formula:

4. μ±z(σn)
5. μ±z(
6. n

7.
8. σ
9.
10. )

Using the values: some common values in the table given below:

Confidence Interval z-value

90% 1.645

95% 1.960

99% 2.576

The confidence interval becomes: (4.480,4.780)

Therefore we are 95% confident that the true mean RBC count for adult
females is between 4.480 and 4.780.
Now let's do the implementation of it using Python (stats library)
from scipy import stats
import numpy as np

sample_mean = 4.63
std_dev = 0.54
sample_size = 50
confidence_level = 0.95
standard_error = std_dev / np.sqrt(sample_size)

z_value = 1.960

margin_of_error = z_value * (std_dev / math.sqrt(sample_size))

lower_limit = sample_mean - margin_of_error

upper_limit = sample_mean + margin_of_error

print(f"Confidence Interval: ({lower_limit:.3f},

{upper_limit:.3f})")
Output:
Confidence Interval: (4.480, 4.780)

Some KeyTakeaways from Confidence Interval are:

● Confidence Intervals are essential in data science to find the

uncertainty of estimates and make predictions more reliable.

● t-distribution is used for small sample sizes (n < 30) while z-

distribution is used for large sample sizes (n > 30).

● Confidence intervals help to make data-driven decisions by

providing a range instead of a single point estimate. This is

especially important in A/B testing, market research and machine

learning.

Sampling variation refers to the differences observed when multiple samples

are drawn from the same population. It's a natural consequence of taking a
subset of a larger group, as each sample is likely to have slightly different
characteristics than the population as a whole. Understanding sampling
variation is crucial in statistics for interpreting sample data and making
inferences about the population.

Elaboration:
Definition:
Sampling variation is the inherent variability that exists when different samples are
drawn from the same population. It's the fact that each sample will likely have a
slightly different makeup than the population, and even different samples will have
differences among them.
Why it matters:
In statistics, sampling variation is a key concept because it helps researchers
understand the uncertainty associated with sample data. By acknowledging and
quantifying sampling variation, researchers can make more accurate inferences
about the population based on the sample data.
Types of sampling variation:
There are different sources of sampling variation, including:
● Random variation: This is the natural fluctuation that occurs due to chance.
●
● Systematic variation: This occurs when a particular sample is consistently
different from the population, often due to bias in the sampling method.
●
Mitigating sampling variation:
Researchers can take steps to minimize sampling variation, including:
● Increasing sample size: Larger samples tend to have less variability, as
they are more representative of the population.
●
● Using appropriate sampling methods: Techniques like stratified sampling
can help to reduce variation within specific subgroups of the population.
●
● Using statistical tests: Researchers can use statistical tests to assess the
magnitude of sampling variation and determine the likelihood that observed
differences are due to chance or real population differences.
●
Examples:
● If you take several samples of students in a school to estimate the average
height, each sample will likely have slightly different average heights. This is
due to sampling variation.
●
● If you are testing a new drug, the effectiveness of the drug in different samples
of patients may vary, even if the drug is truly effective. This is due to sampling
variation.

7 QC Tools
100% (4)
7 QC Tools
127 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
72 pages
Ccw331-Business Analytics Printed Notes
100% (1)
Ccw331-Business Analytics Printed Notes
59 pages
Data Science Essentials Guide
No ratings yet
Data Science Essentials Guide
12 pages
Introduction to Business Analytics
No ratings yet
Introduction to Business Analytics
12 pages
Statistics For Engineers: An Introduction With Examples From Practice 1st Edition Hartmut Schiefer PDF Download
No ratings yet
Statistics For Engineers: An Introduction With Examples From Practice 1st Edition Hartmut Schiefer PDF Download
150 pages
Unidad 3 The Wonders of The Modern Technology
No ratings yet
Unidad 3 The Wonders of The Modern Technology
34 pages
Unit 5 PPT Quality Assurance Control
No ratings yet
Unit 5 PPT Quality Assurance Control
70 pages
Statistics Worksheet for Students
No ratings yet
Statistics Worksheet for Students
5 pages
Business Analytics Chapter1 3
No ratings yet
Business Analytics Chapter1 3
3 pages
Business Analytics CH 1
No ratings yet
Business Analytics CH 1
37 pages
Statistics For Managers Notes
No ratings yet
Statistics For Managers Notes
57 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
Data Analytics - UNIT1
No ratings yet
Data Analytics - UNIT1
18 pages
Data Analytics Unit1-4
No ratings yet
Data Analytics Unit1-4
195 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
15 pages
AD8551-BA Unit 1 To Unit 3
No ratings yet
AD8551-BA Unit 1 To Unit 3
159 pages
Unit 2 DS
No ratings yet
Unit 2 DS
30 pages
DA Unit 1
No ratings yet
DA Unit 1
14 pages
DA Notes-Unit 1
No ratings yet
DA Notes-Unit 1
11 pages
1 Introduction To Data Analytics
No ratings yet
1 Introduction To Data Analytics
14 pages
Unit 1 - 1 Data Analytics Introduction
No ratings yet
Unit 1 - 1 Data Analytics Introduction
73 pages
CH 1
No ratings yet
CH 1
31 pages
DAUnit 2
No ratings yet
DAUnit 2
18 pages
Unit 1-2
No ratings yet
Unit 1-2
8 pages
Module 1
No ratings yet
Module 1
40 pages
AA THeory and Methods
No ratings yet
AA THeory and Methods
40 pages
SRU ADA Unit-1
No ratings yet
SRU ADA Unit-1
50 pages
Data Sci Notes
No ratings yet
Data Sci Notes
88 pages
U1 C CLSRM
No ratings yet
U1 C CLSRM
30 pages
What Is Business Analytics?
No ratings yet
What Is Business Analytics?
28 pages
Q) Concept of Data Analytics
No ratings yet
Q) Concept of Data Analytics
28 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
Unit 1
No ratings yet
Unit 1
21 pages
Descriptive Statistics Part 1
No ratings yet
Descriptive Statistics Part 1
18 pages
Case Study (16b) Group
No ratings yet
Case Study (16b) Group
18 pages
2 Types of Data Analytics
No ratings yet
2 Types of Data Analytics
21 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
Unit 1 221226040256 44f48981
No ratings yet
Unit 1 221226040256 44f48981
32 pages
Introduction to Statistics Lecture Notes
100% (1)
Introduction to Statistics Lecture Notes
60 pages
Table of Specification: Fourth Quarterly Exam Mathematics 07 S.Y. 2022-2023
50% (4)
Table of Specification: Fourth Quarterly Exam Mathematics 07 S.Y. 2022-2023
3 pages
Unit 2 Measures of Central Tendency and Dispersion: Structure
No ratings yet
Unit 2 Measures of Central Tendency and Dispersion: Structure
27 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Data Analytics
No ratings yet
Data Analytics
32 pages
Shruti Internship Report
No ratings yet
Shruti Internship Report
14 pages
Data Analytics - 1
No ratings yet
Data Analytics - 1
21 pages
Data Handling
No ratings yet
Data Handling
7 pages
Unit I
No ratings yet
Unit I
47 pages
Overview of Data Analysis
No ratings yet
Overview of Data Analysis
4 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Da Unit 1
No ratings yet
Da Unit 1
12 pages
Analytics and Data Science
No ratings yet
Analytics and Data Science
12 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Module I - 1
No ratings yet
Module I - 1
23 pages
002 - Discover Data Analysis - Overview of Data Analysis
No ratings yet
002 - Discover Data Analysis - Overview of Data Analysis
4 pages
Data Analytics Introduction Guide
No ratings yet
Data Analytics Introduction Guide
13 pages
Dataanalyticsunit-1 (2) 104014
No ratings yet
Dataanalyticsunit-1 (2) 104014
51 pages
Module 2 Data Analytics and Its Type
No ratings yet
Module 2 Data Analytics and Its Type
9 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
3 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
BigData DataAnalyticsTypes
No ratings yet
BigData DataAnalyticsTypes
9 pages
1overview of Data Analysis
No ratings yet
1overview of Data Analysis
3 pages
Data Analytics: Types and Uses
No ratings yet
Data Analytics: Types and Uses
3 pages
Adobe Scan 27-Mar-2024
No ratings yet
Adobe Scan 27-Mar-2024
12 pages
Module 2 - Fund. of Business Analytics
No ratings yet
Module 2 - Fund. of Business Analytics
26 pages
Business Analytics & Decision Making
No ratings yet
Business Analytics & Decision Making
37 pages
405d - Business Statistics
No ratings yet
405d - Business Statistics
21 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
19 pages
11 Economics - Measures of Dispersion - Notes and Video Links
No ratings yet
11 Economics - Measures of Dispersion - Notes and Video Links
8 pages
NT TR 569 Ed4 Control de Calidad Interno
No ratings yet
NT TR 569 Ed4 Control de Calidad Interno
53 pages
Central Tendency & Weighted Mean Guide
No ratings yet
Central Tendency & Weighted Mean Guide
34 pages
Fundamentals of Report Writing: Preparing Informative & Influential Business Report
No ratings yet
Fundamentals of Report Writing: Preparing Informative & Influential Business Report
12 pages
The Neuropsychological Assessment of Cognitive Deficits
No ratings yet
The Neuropsychological Assessment of Cognitive Deficits
11 pages
Measures of Dispersion Explained
No ratings yet
Measures of Dispersion Explained
13 pages
Module 4 Week 7 (Grade 7 Mathematics)
No ratings yet
Module 4 Week 7 (Grade 7 Mathematics)
12 pages
Imp - MEASURES OF DISPERSION
No ratings yet
Imp - MEASURES OF DISPERSION
5 pages
Deviation of Y Values Is Equal To: A. A Q.D (X) + B B. A Q.D (X) C. Q.D (X) - B D. B Q.D (X)
No ratings yet
Deviation of Y Values Is Equal To: A. A Q.D (X) + B B. A Q.D (X) C. Q.D (X) - B D. B Q.D (X)
23 pages
ASA Notes
No ratings yet
ASA Notes
28 pages
Chapter 8 (Cont.) : Gauge and Measurement System Capability Analysis
No ratings yet
Chapter 8 (Cont.) : Gauge and Measurement System Capability Analysis
11 pages
Module 2 - Descriptive Analysis
No ratings yet
Module 2 - Descriptive Analysis
8 pages
Statistics Quiz: Mean, Median, Mode
100% (1)
Statistics Quiz: Mean, Median, Mode
3 pages
Statistics 1: Cairo University Faculty of Economics and Political Science Department of Statistics
No ratings yet
Statistics 1: Cairo University Faculty of Economics and Political Science Department of Statistics
20 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Ballistic Modeling of Multi-Grain Solid Rocket Motor Performance
No ratings yet
Ballistic Modeling of Multi-Grain Solid Rocket Motor Performance
24 pages
Midterm Engineering Data Analysis PART C AND D
No ratings yet
Midterm Engineering Data Analysis PART C AND D
4 pages
3 Dispersion Exercises
No ratings yet
3 Dispersion Exercises
3 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit - I Introduction to Data Analytics

1.1 Data Analytics:

For example, descriptive statistical analysis could show sales distribution

Data analytics is an important field that involves the process of collecting,

NEED OF DATA ANALYTICS:

1.2TYPES / CATEGORIES/ MODELS OF DATA ANALYTICS

There are four major types of data analytics:

Descriptive Analytics Descriptive analytics looks at data and analyze past

1.3 Life cycle of Data Analytics, Quality and Quantity of data,

Data Analytics Lifecycle :

● The data science team learns and investigates the problem.

● Develop context and understanding.

● The team formulates the initial hypothesis that can be later

tested with data.

● Phase 2: Data Preparation -

● Steps to explore, preprocess, and condition data before modeling

executes, loads, and transforms, to get data into the sandbox.

● Data preparation tasks are likely to be performed multiple times

and not in predefined order.

Miner, Open Refine, etc.

● Phase 3: Model Planning -

● The team explores data to learn about relationships between

variables and subsequently, selects key variables and the most

training, testing, and production purposes.

model planning phase.

● Phase 4: Model Building -

● Team develops datasets for testing, training, and production

running the models or if they need more robust environment for

● Commercial tools - Matlab and STASTICA.

● Phase 5: Communication Results -

● After executing model team need to compare outcomes of

modeling to criteria established for success and failure.

● Team considers how best to articulate findings and outcomes to

various team members and stakeholders, taking into account

● Team should identify key findings, quantify business value, and

develop narrative to summarize and convey findings to

● The team communicates benefits of project more broadly and

sets up pilot project to deploy work in controlled way before

broadening the work to full enterprise of users.

● This approach enables team to learn about performance and

related constraints of the model in production environment on

small scale which make adjustments before full deployment.

● The team delivers final reports, briefings, codes.

● Free or open source tools - Octave, WEKA, SQL, MADlib.

Completeness: The extent to which all necessary information is present.

Timeliness: The freshness and relevance of the data.

Uniqueness: The absence of duplicate or redundant entries.

Integrity: The assurance that the data is accurate and reliable.

● Importance of Data Quality and Quantity:

Validity: The validity of conclusions depends on the quality and quantity of

Generalizability: Adequate data quantity and quality allow for the

Informed Decision-Making: Quality data provides a solid foundation for

Cost-Effectiveness: Poor data quality can lead to wasted resources and

● Measuring Data Quality and Quantity:

7 Metrics & Their Measurement

The accuracy of your data is essential to avoid problems with transaction

Your decision-making is also dependent on high accuracy, as incorrect data

Imagine forecasting your manufacturing budget based on data that

● Data profiling is the process of analyzing your data to identify

● Data validation assesses your data to ensure it is accurate, complete,

Related Post: Data Governance and Data Quality: Working Together

Similar to data accuracy, data profiling is a vital process in assessing your

Related: Emerging Data Quality Frameworks

● Outlier analysis is the process of identifying data points significantly

And at the decision-making level, consistency is important as you can directly

● Cross-validation requires analysts to compare data from different

● Schema checks: This process checks data against predefined data

This is significant in your decision-making process because things change all

● Data freshness analysis: This involves analyzing how recently the

● Data source monitoring: This process concerns monitoring your data

This is important because duplicate data can conflate or mislead your

● Data deduplication: Often used alongside duplicate detection, this

This is important to measure because if your integrity degrades, then your

Part of this assessment is the continued use of data validation rules as we

● Referential integrity checks: This is about ensuring that changes

1.4Data Types, Measure of central tendency, Measures of dispersion