0% found this document useful (0 votes)

32 views35 pages

Unit 2 More Notes

Uploaded by

MANAV SOLANKI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views35 pages

Unit 2 More Notes

Uploaded by

MANAV SOLANKI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

What is Data Quality and Why is it Important?




The significance of data quality in today’s data-driven environment is quite
impossible to emphasize. The quality of data becomes more crucial when
organizations, businesses and individuals depend more and more on it for their
work.

This article gives detailed idea of what is data quality? its importance, and the
essential procedures for making sure the data we use is correct, reliable, and
appropriate for the reason for which data was collected.

 What is Data Quality?

 Data Quality vs Data Integrity
 Data Quality Dimensions
 Why is Data Quality Important?
 What is Good Data Quality?
 How to Improve Data Quality?
 Challenges in Data Quality
 Benefits of Data Quality
What is Data Quality?
Data quality refers to the reliability, accuracy, completeness, and consistency of
data. High-quality data is free from errors, inconsistencies, and inaccuracies,
making it suitable for reliable decision-making and analysis. Data quality
encompasses various aspects, including correctness, timeliness, relevance, and
adherence to predefined standards. Organizations prioritize data quality to ensure
that their information assets meet the required standards and contribute effectively
to business processes and decision-making. Effective data quality management
involves processes such as data profiling, cleansing, validation, and monitoring to
maintain and improve data integrity.
Data Quality Process:
 Discover: Use data profiles to comprehend cause abnormalities.
 Define: Specify the standards for standardization and cleaning.
 Integrate: Apply specified guidelines to procedures for data quality.
 Monitor: Continuously monitor and report on data quality.
Data Quality vs Data Integrity
Oversight of data quality is only one component of data integrity, which includes
many other elements as well. Keeping data valuable and helpful to the company is
the main objective of data integrity. To achieve data integrity, the following four
essential elements are necessary:
 Data Integration: The smooth integration of data from various sources
is very much essential.
 Data Quality: A vital aspect of maintaining data integrity is verifying
that the information is complete, legitimate, unique, current, and
accurate.
 Location Intelligence: when location insights are included in the data, it
gains dimension and therefore becomes more useful and actionable.
 Data Enrichment: By adding more information from outside sources,
such customer, business, and geographical data, data enrichment may
improve the context and completeness of data.
Data Quality Dimensions
The Data Quality Assessment Framework (DQAF) is primarily divided into 6 parts
that includes characteristics of data quality: completeness, timeliness, validity,
integrity, uniqueness, and consistency. When assessing the quality of a certain
dataset at any given time, these dimensions are helpful. The majority of data
managers give each dimension an average DQAF score between 0 and 100.
 Completeness: The percentage of missing data in a dataset is used to
determine completeness. The accuracy of data on goods and services is
essential for assisting prospective buyers in evaluating, contrasting, and
selecting various sales items.
 Timeliness: This refers to how current or outdated the data is at any one
time. For instance, there would be a problem with timeliness if you had
client data from 2008 and it is now 2021.
 Validity: Data that doesn’t adhere to certain firm policies, procedures, or
formats is considered invalid. For instance, a customer’s birthday may be
requested by several programs. However, the quality of the data is
immediately affected if the consumer enters their birthday incorrectly or
in an incorrect format.
 Integrity: The degree to which information is dependable and
trustworthy is referred to as data integrity. Are the facts and statistics
accurate?
 Uniqueness: The attribute of data quality that is most frequently
connected to customer profiles is uniqueness. Long-term profitability and
success are frequently based on more accurate compilation of unique
customer data, including performance metrics linked to each consumer
for specific firm goods and marketing activities.
 Consistency: Analytics are most frequently linked to data consistency. It
guarantees that the information collecting source is accurately acquiring
data in accordance with the department’s or company’s specific goals.
Why is Data Quality Important?
Over the past 10 years, the Internet of Things (IoT), artificial
intelligence (AI), edge computing, and hybrid clouds all have contributed to
exponential growth of big data. Due to which, the maintenance of master data
(MDM) has become a more typical task which requires involvement of more data
stewards and more controls to ensure data quality.
To support data analytics projects, including business intelligence dashboards,
businesses depend on data quality management. Without it, depending on the
business (e.g. healthcare), there may be disastrous repercussions, even moral ones.
 Now, organizations that possess high-quality data are able to create key
performance indicators (KPIs) that assess the effectiveness of different
projects, enabling teams to expand or enhance them more efficiently.
Businesses that put a high priority on data quality will surely have an
advantage over rivals.
 Teams who have access to high-quality data are better able to pinpoint
the locations of operational workflow failures.
What is Good Data Quality?
Data that satisfies requirements for the accuracy, consistency, dependability,
completeness, and relevance is said to be good data quality. For enterprises to get
valuable insights, make wise decisions, and run smoothly, they need to maintain
high data quality. The following are essential aspects of good data quality:
 Accuracy: Error-free, accurate data ensures reliability of the data for
analysis and decision-making .
 Completeness: Complete data has all the information needed for a
particular task.
 Relevance: Relevant data is appropriate for the work at hand and fits the
intended goal. Outdated or irrelevant information might lead to poor
decisions.
 Consolidation: Data is free from duplicates, guaranteeing that
each information is only retained once.
How to Improve Data Quality?
Improving data quality involves implementing strategies and processes to enhance
the reliability, accuracy, and overall integrity of your data. Here are key steps to
improve data quality:
 Define Data Quality Standards: Clearly define the standards and
criteria for high-quality data relevant to organization’s goals. This
includes specifying data accuracy, completeness, consistency, and other
essential attributes.
 Data Profiling: Conduct data profiling to analyze and understand the
structure, patterns, and quality of your existing data. Identify anomalies,
errors, or inconsistencies that need attention.
 Validation Rules: Establish validation rules to enforce data quality
standards during data entry. This helps prevent the introduction of errors
and ensures that new data adheres to predefined criteria.
 Data Governance: Implement a robust data governance framework with
clear policies, responsibilities, and processes for managing and ensuring
data quality. Establish data stewardship roles to oversee and enforce data
quality standards.
 Regular Audits: Conduct regular audits and reviews of your data to
identify and address issues promptly. Establish a schedule for ongoing
data quality checks and assessments.
 Automated Monitoring: Implement automated monitoring tools to
continuously track data quality metrics. Set up alerts for anomalies or
deviations from established standards, enabling proactive intervention.
Challenges in Data Quality
Some of the challenges in data quality are listed below.
 Incomplete Data: It can be difficult to collect thorough and accurate
data, resulting in gaps and flaws in the data that can be analyzed and
used to make decisions.
 Issues with Data Accuracy: Inconsistencies in data sources,
faults during data input, and system malfunctions can all cause accuracy
issues.
 Data Integration Complexity
 Data Governance
Other than these, there can be many other challenges like lack of standardization,
Data Security and Privacy Concerns, Data Quality Monitoring, Technology
Limitations, and human error.
Benefits of Data Quality
 Informed Decision: Reliable and accurate data lowers the chance of
mistakes and facilitates improved decision-making.
 Operational Efficiency: High-quality data reduces mistakes and boosts
productivity, which simplifies processes.
 Enhanced Customer Satisfaction: Precise client information results in
more individualized experiences and better service.
 Compliance and Risk Mitigation: Lowering legal and compliance risks
is achieved by adhering to data quality standards.
 Cost saving: Good data reduces the need for revisions, which saves
money.
 Credibility and Trust: Credibility and confidence are gained by
organizations with high-quality data among stakeholders.
.
What are the 7 C’s of data quality?
 Correctness: Accuracy and precision.
 Completeness: Presence of all necessary data.
 Consistency: Uniformity and absence of contradictions.
 Clarity: Clearly defined and understandable data.
 Conformity: Adherence to standards.
 Credibility: Trustworthiness.
 Continuity: Consistency over time.
Data saturates the modern world. Data is information, information is knowledge, and
knowledge is power, so data has become a form of contemporary currency, a valued
commodity exchanged between participating parties.

Data helps people and organizations make more informed decisions, significantly increasing
the likelihood of success. By all accounts, that seems to indicate that large amounts of data
are a good thing. However, that’s not always the case. Sometimes data is incomplete,
incorrect, redundant, or not applicable to the user’s needs.

But fortunately, we have the concept of data quality to help make the job easier. So let’s
explore what is data quality, including what are its characteristics and best practices, and how
we can use it to make data better.

What’s the Definition of Data Quality?

In simple terms, data quality tells us how reliable a particular set of data is and whether or not
it will be good enough for a user to employ in decision-making. This quality is often
measured by degrees.

But What Is Data Quality, in Practical Terms?

Data quality measures the condition of data, relying on factors such as how useful it is to the
specific purpose, completeness, accuracy, timeliness (e.g., is it up to date?), consistency,
validity, and uniqueness.

Data quality analysts are responsible for conducting data quality assessments, which involve
assessing and interpreting every quality data metric. Then, the analyst creates an aggregate
score reflecting the data’s overall quality and gives the organization a percentage rating that
shows how accurate the data is.
To put the definition in more direct terms, data quality indicates how good the data is and
how useful it is for the task at hand. But the term also refers to planning, implementing, and
controlling the activities that apply the needed quality management practices and techniques
required to ensure the data is actionable and valuable to the data consumers.

Now, let us look at data quality dimensions after you better understand what is data quality.

Data Quality Dimensions

There are six primary, or core, dimensions to data quality. These are the metrics analysts use
to determine the data’s viability and its usefulness to the people who need it.

 Accuracy

The data must conform to actual, real-world scenarios and reflect real-world objects and
events. Analysts should use verifiable sources to confirm the measure of accuracy,
determined by how close the values jibe with the verified correct information sources.

 Completeness

Completeness measures the data's ability to deliver all the mandatory values that are available
successfully.

 Consistency

Data consistency describes the data’s uniformity as it moves across applications and networks
and when it comes from multiple sources. Consistency also means that the same datasets
stored in different locations should be the same and not conflict. Note that consistent data can
still be wrong.

 Timeliness
Timely data is information that is readily available whenever it’s needed. This dimension also
covers keeping the data current; data should undergo real-time updates to ensure that it is
always available and accessible.

 Uniqueness

Uniqueness means that no duplications or redundant information are overlapping across all
the datasets. No record in the dataset exists multiple times. Analysts use data cleansing and
deduplication to help address a low uniqueness score.

 Validity

Data must be collected according to the organization’s defined business rules and parameters.
The information should also conform to the correct, accepted formats, and all dataset values
should fall within the proper range.

How Do You Improve Data Quality?

People looking for ideas on how to improve data quality turn to data quality management for
answers. Data quality management aims to leverage a balanced set of solutions to prevent
future data quality issues and clean (and ideally eventually remove) data that fails to meet
data quality KPIs (Key Performance Indicators). These actions help businesses meet their
current and future objectives.

There is more to data quality than just data cleaning. With that in mind, here are the eight
mandatory disciplines used to prevent data quality problems and improve data quality by
cleansing the information of all bad data:

 Data Governance

Data governance spells out the data policies and standards that determine the required data
quality KPIs and which data elements should be focused on. These standards also include
what business rules must be followed to ensure data quality.
 Data Profiling

Data profiling is a methodology employed to understand all data assets that are part of data
quality management. Data profiling is crucial because many of the assets in question have
been populated by many different people over the years, adhering to different standards.

 Data Matching

Data matching technology is based on match codes used to determine if two or more bits of
data describe the same real-world thing. For instance, say there’s a man named Michael
Jones. A customer dataset may have separate entries for Mike Jones, Mickey Jones, Jonesy,
Big Mike Jones, and Michael Jones, but they’re all describing one individual.

 Data Quality Reporting

Information gathered from data profiling, and data matching can be used to measure data
quality KPIs. Reporting also involves operating a quality issue log, which documents known
data issues and any follow-up data cleansing and prevention efforts.

 Master Data Management (MDM)

Master Data Management frameworks are great resources for preventing data quality issues.
MDM frameworks deal with product master data, location master data, and party master data.

 Customer Data Integration (CDI)

CDI involves compiling customer master data gathered via CRM applications, self-service
registration sites. This information must be compiled into one source of truth.

 Product Information Management (PIM)

Manufacturers and sellers of goods need to align their data quality KPIs with each other so
that when customers order a product, it will be the same item at all stages of the supply chain.
Thus, much of PIM involves creating a standardized way to receive and present product data.
 Digital Asset Management (DAM)

Digital assets cover items like videos, text documents, images, and similar files, used
alongside product data. This discipline involves ensuring that all tags are relevant and the
quality of the digital assets.

Data Quality Best Practices

Data analysts who strive to improve data quality need to follow best practices to meet their
objectives. Here are ten critical best practices to follow:

 Make sure that top-level management is involved. Data analysts can resolve many data
quality issues through cross-departmental participation.

 Include data quality activity management as part of your data governance framework. The
framework sets data policies and data standards, the required roles and offers a business
glossary.

 Each data quality issue raised must begin with a root cause analysis. If you don’t address
the root cause of a data issue, the problem will inevitably appear again. Don’t just address
the symptoms of the disease; you need to cure the disease itself.

 Maintain a data quality issue log. Each issue needs an entry, complete with information
regarding the assigned data owner, the involved data steward, the issue’s impact, the final
resolution, and the timing of any necessary proceedings.

 Fill data owner and data steward roles from your company’s business side and fill data
custodian roles from either business or IT whenever possible and makes the most sense.

 Use examples of data quality disasters to raise awareness about the importance of data
quality. However, while anecdotes are great for illustrative purposes, you should rely on
fact-based impact and risk analysis to justify your solutions and their required funding.

 Your organization’s business glossary must serve as the foundation for metadata
management.

 Avoid typing in data where possible. Instead, explore cost-effective solutions for data
onboarding that employ third-party data sources that provide publicly available data. This
data includes items such as names, locations in general, company addresses and IDs, and
in some cases, individual people. When dealing with product data, use second-party data
from trading partners whenever you can.

 When resolving data issues, make every effort to implement relevant processes and
technology that stops the problems from arising as close as possible to the data onboarding
point instead of depending on downstream data cleansing.

 Establish data quality KPIs that work in tandem with the general KPIs for business
performance. Data quality KPIs, sometimes called Data Quality Indicators (DQIs), can
often be associated with data quality dimensions like uniqueness, completeness, and
consistency.

Data Cleaning |
Data cleaning involves spotting and resolving potential data inconsistencies or
errors to improve your data quality. An error is any value (e.g., recorded weight) that
doesn’t reflect the true value (e.g., actual weight) of whatever is being measured.

Why does data cleaning matter?

In quantitative research, you collect data and use statistical analyses to answer a
research question. Using hypothesis testing, you find out whether your data
demonstrate support for your research predictions.

Example: Quantitative researchYou investigate whether a new drug reduces the effects of
fatigue.

You survey participants before and at the end of the drug treatment. Using closed-
ended questions, you ask Likert-scale questions about participants’ experiences and
symptoms on a 1-to-7 scale
Errors are often inevitable, but cleaning your data helps you minimise them. If you
don’t remove or resolve these errors, you could end up with a false or invalid study
conclusion.

Example: Data errorsMost of the questions are framed positively, but some questions have
negative frames to engage the participants.

Question: Please rate the extent to which you agree or disagree with these
statements from 1 to 7.

 Positive frame: I feel well rested when I wake up in the morning.

 Negative frame: I do not feel energetic after getting 8 hours of sleep at night.

Both questions measure the same thing: how respondents feel after waking up in the
morning. But the answers to negatively worded questions need to be reverse-coded
before analysis so that all answers are consistently in the same direction.
Reverse coding means flipping the number scale in the opposite direction so that an
extreme value (e.g., 1 or 7) means the same thing for each question.

If you forget to reverse-code these answers before analysis, you may end up with an
invalid conclusion because of data errors.
With inaccurate or invalid data, you might make a Type I or II error in your
conclusion. These types of erroneous conclusions can be practically significant with
important consequences, because they lead to misplaced investments or missed
opportunities.

Example: Type I errorBased on the results, you make a Type I error. You conclude that the
drug is effective when it’s not.

Your organisation decides to invest in this new drug and people are prescribed the
drug instead of effective therapies.

Dirty vs clean data

Dirty data include inconsistencies and errors. These data can come from any part of
the research process, including poor research design, inappropriate measurement
materials, or flawed data entry.

Clean data meet some requirements for high quality while dirty data are flawed in
one or more ways. Let’s compare dirty with clean data.

Dirty data Clean data

Invalid Valid

Inaccurate Accurate

Incomplete Complete

Inconsistent Consistent

Duplicate entries Unique

Incorrectly Uniform
formatted

Valid data
Valid data conform to certain requirements for specific types of information (e.g.,
whole numbers, text, dates). Invalid data don’t match up with the possible values
accepted for that observation.

Example: Data validationA date of birth on a form may only be recognised if it’s formatted a
certain way, for example, as dd-mm-yyyy, if you use data validation techniques.

The day field will allow numbers up to 31, the month field up to 12, and the year field
up to 2021. If any numbers exceed those values, the form won’t be submitted.
Without valid data, your data analysis procedures may not make sense. It’s best to
use data validation techniques to make sure your data are in the right formats before
you analyse them.

Accurate data
In measurement, accuracy refers to how close your observed value is to the true
value. While data validity is about the form of an observation, data accuracy is about
the actual content.

Example: Inaccurate dataYou ask survey respondents the following question:

How often do you go grocery shopping in person?

 Every day
 Once a week
 Biweekly
 Once a month
 Less than once a month
 Never

Some of the respondents select ‘biweekly’ as their answer. But this word can mean
either twice a week or once every two weeks, and these are fairly different
frequencies.

You have no idea how each person interpreted this word, so your data are
inaccurate because of inadequate response items.

Complete data
Complete data are measured and recorded thoroughly. Incomplete data are
statements or records with missing information.

Example: Incomplete dataIn an online survey, a participant starts entering a response to an

open-ended question. But they get distracted and do something else before returning to the
survey. They move on to the next question without filling in a complete answer.

Reconstructing missing data isn’t easy to do. Sometimes, you might be able to
contact a participant and ask them to redo a survey or an interview, but you might
not get the answer that you would have otherwise.

Consistent data
Clean data are consistent across a dataset. For each member of your sample, the
data for different variables should line up to make sense logically.

Example: Inconsistent dataIn your survey, you collect information about demographic
variables, including age, ethnicity, education level, and socioeconomic status. One
participant enters ’13’ for their age and PhD-level education as their highest attained degree.

These data are inconsistent because it’s highly unlikely for a 13-year-old to hold a
doctorate degree in your specific sample. It’s more likely that an incorrect age was
entered.

Unique data
In data collection, you may accidentally record data from the same participant twice.

Example: Duplicate entriesIn an online survey, a participant fills in the questionnaire and hits
enter twice to submit it. The data gets reported twice on your end.

It’s important to review your data for identical entries and remove any duplicate
entries in data cleaning. Otherwise, your data might be skewed.

Uniform data
Uniform data are reported using the same units of measure. If data aren’t all in the
same units, they need to be converted to a standard measure.

Example: Nonuniform dataIn a survey, you ask participants to enter their gross salary in
pounds.

Some participants respond with their monthly salary, while others report their annual
salary.

Unless you provide a time unit, participants may answer this question using different
time frames. You won’t know for sure whether they’re reporting their monthly or
annual salary.

How do you clean data?

Every dataset requires different techniques to clean dirty data, but you need to
address these issues in a systematic way. You’ll want to conserve as much of your
data as possible while also ensuring that you end up with a clean dataset.

Data cleaning is a difficult process because errors are hard to pinpoint once the data
are collected. You’ll often have no way of knowing if a data point reflects the actual
value of something accurately and precisely.

In practice, you may focus instead on finding and resolving data points that don’t
agree or fit with the rest of your dataset in more obvious ways. These data might be
missing values, outliers, incorrectly formatted, or irrelevant.
You can choose a few techniques for cleaning data based on what’s appropriate.
What you want to end up with is a valid, consistent, unique, and uniform dataset
that’s as complete as possible.

Data cleaning workflow

Generally, you start data cleaning by scanning your data at a broad level. You review
and diagnose issues systematically and then modify individual items based on
standardised procedures. Your workflow might look like this:

1. Apply data validation techniques to prevent dirty data entry.

2. Screen your dataset for errors or inconsistencies.
3. Diagnose your data entries.
4. Develop codes for mapping your data into valid values.
5. Transform or remove your data based on standardised procedures.

Not all of these steps will be relevant to every dataset. You can carefully apply data
cleaning techniques where necessary, with clear documentation of your processes
for transparency.

By documenting your workflow, you ensure that other people can review and
replicate your procedures.

Data validation
Data validation involves applying constraints to make sure you have valid and
consistent data. It’s usually applied even before you collect data, when designing
questionnaires or other measurement materials requiring manual data entry.

Different data validation constraints help you minimise the amount of data cleaning
you’ll need to do.

Data-type constraints: Values can only be accepted if they are of a certain type,
such as numbers or text.

Example: Data-type constraintIf a date is entered with both text and numbers (e.g., 20 March
2021), instead of just numbers (e.g., 20-03-2021), it will not be accepted.

Range constraints: Values must fall within a certain range to be valid.

Example: Range constraintYou design a questionnaire for a target population with ages
ranging from 18 to 45. When reporting age, participants can only enter a value between 18
and 45 to proceed with the form.

Mandatory constraints: A value must be entered.

Example: Mandatory constraintParticipants filling in a form must select a button that says ‘I
consent’ to begin.

Data screening
Once you’ve collected your data, it’s best to create a backup of your original dataset
and store it safely. If you make any mistakes in your workflow, you can always start
afresh by duplicating the backup and working from the new copy of your dataset.

Data screening involves reviewing your dataset for inconsistent, invalid, missing, or
outlier data. You can do this manually or with statistical methods.

Step 1: Straighten up your dataset

These actions will help you keep your data organised and easy to understand.

 Turn each variable (measure) into a column and each case (participant) into a row.
 Give your columns unique and logical names.
 Remove any empty rows from your dataset.

Step 2: Visually scan your data for possible discrepancies

Go through your dataset and answer these questions:

 Are there formatting irregularities for dates, or textual or numerical data?

 Do some columns have a lot of missing data?
 Are any rows duplicate entries?
 Do specific values in some columns appear to be extreme outliers?

Make note of these issues and consider how you’ll address them in your data
cleaning procedure.

Step 3: Use statistical techniques and tables/graphs to explore data

By gathering descriptive statistics and visualisations, you can identify how your data
are distributed and identify outliers or skewness.

1. Explore your data visually with boxplots, scatterplots, or histograms

2. Check whether your data are normally distributed
3. Create summary (descriptive) statistics for each variable
4. Summarise your quantitative data in frequency tables

You can get a rough idea of how your quantitative variable data are distributed by
visualising them. Boxplots and scatterplots can show how your data are distributed
and whether you have any extreme values. It’s important to check whether your
variables are normally distributed so that you can select appropriate statistical
tests for your research.

If your mean, median, and mode all differ from each other by a lot, there may be
outliers in the dataset that you should look into.

Data diagnosing
After a general overview, you can start getting into the nitty-gritty of your dataset.
You’ll need to create a standard procedure for detecting and treating different types
of data.
Without proper planning, you might end up cherry-picking only some data points to
clean, leading to a biased dataset.

Here we’ll focus on ways to deal with common problems in dirty data:

 Duplicate data
 Invalid data
 Missing values
 Outliers

De-duplication
De-duplication means detecting and removing any identical copies of data, leaving
only unique cases or participants in your dataset.

Example: De-duplicationYou compile your data in a spreadsheet where the columns are the
questions and the rows are the participants. Each row contains one participant’s data.

You sort the data by a column and review the data row by row to check whether
there are any identical rows. You remove identical copies of a row.
If duplicate data are left in the dataset, they will bias your results. Some participants’
data will be weighted more heavily than others’.

Invalid data
Using data standardisation, you can identify and convert data from varying formats
into a uniform format.

Unlike data validation, you can apply standardisation techniques to your data after
you’ve collected it. This involves developing codes to convert your dirty data into
consistent and valid formats.

Data standardisation is helpful if you don’t have data constraints at data entry or if
your data have inconsistent formats.

Example: Invalid dataUsing an open-ended question, you ask participants to report their age.
Your responses contain a mix of numbers and text, with some typos.

These are some of the responses:

 23
 twenty
 19
 eihgteen
 22

String-matching methods
To standardise inconsistent data, you can use strict or fuzzy string-matching
methods to identify exact or close matches between your data and valid values.
A string is a sequence of characters. You compare your data strings to the valid
values you expect to obtain and then remove or transform the strings that don’t
match.

Strict string-matching: Any strings that don’t match the valid values exactly are
considered invalid.

Example: Strict string-matchingYour valid values include numbers between 18 and 45 and
any correctly spelled words denoting numbers with the first letter capitalised.

In this case, only 3 out of 5 values will be accepted with strict matching.

 23
 twenty
 19
 eihgteen
 22

Fuzzy string-matching: Strings that closely or approximately match valid values are
recognised and corrected.

Example: Fuzzy string-matchingYour valid values include numbers between 18 and 45 and
any words denoting numbers. You use a computer program to allow any values that closely
match these valid values in your dataset.

For closely matching strings, your program checks how many edits are needed to
change the string into a valid value, and if the number of edits is small enough, it
makes those changes.

All five values will be accepted with fuzzy string-matching.

 23
 twenty
 19
 eihgteen
 22

After matching, you can transform your text data into numbers so that all values are
consistently formatted.

Fuzzy string-matching is generally preferable to strict string-matching because more

data are retained.

Missing data
In any dataset, there’s usually some missing data. These cells appear blank in your
spreadsheet.

Missing data can come from random or systematic causes.

 Random missing data include data entry errors, inattention errors, or misreading of
measures.
 Non-random missing data result from confusing, badly designed, or inappropriate
measurements or questions.

Dealing with missing data

Your options for tackling missing data usually include:

 Accepting the data as they are

 Removing the case from analyses
 Recreating the missing data

Random missing data are usually left alone, while non-random missing data may
need removal or replacement.

With deletion, you remove participants with missing data from your analyses. But
your sample may become smaller than intended, so you might lose statistical power.

Example: Missing data removalYou decide to remove all participants with missing data from
your survey dataset. This reduces your sample from 114 participants to 77.

Alternatively, you can use imputation to replace a missing value with another value
based on a reasonable estimate. You use other data to replace the missing value for
a more complete dataset.

It’s important to apply imputation with caution, because there’s a risk of bias or
inaccuracy.

Outliers
Outliers are extreme values that differ from most other data points in a dataset.
Outliers can be true values or errors.

True outliers should always be retained, because these just represent natural
variations in your sample. For example, athletes training for a 100-metre Olympic
sprint have much higher speeds than most people in the population. Their sprint
speeds are natural outliers.

Outliers can also result from measurement errors, data entry errors, or
unrepresentative sampling. For example, an extremely low sprint time could be
recorded if you misread the timer.

Detecting outliers
Outliers are always at the extreme ends of any variable dataset.

You can use several methods to detect outliers:

 Sorting your values from low to high and checking minimum and maximum values
 Visualising your data in a boxplot and searching for outliers
 Using statistical procedures to identify extreme values

Dealing with outliers

Once you’ve identified outliers, you’ll decide what to do with them in your dataset.
Your main options are retaining or removing them.

In general, you should try to accept outliers as much as possible unless it’s clear that
they represent errors or bad data.

It’s important to document each outlier you remove and the reasons so that other
researchers can follow your procedures.

Inferential Statistics – Types, Methods and

Examples

Inferential Statistics
Inferential statistics is a branch of statistics that involves making predictions or
inferences about a population based on a sample of data taken from that
population. It is used to analyze the probabilities, assumptions, and outcomes of
a hypothesis.
The basic steps of inferential statistics typically involve the following:

 Define a Hypothesis: This is often a statement about a parameter of a

population, such as the population mean or population proportion.
 Select a Sample: In order to test the hypothesis, you’ll select a sample from
the population. This should be done randomly and should be representative
of the larger population in order to avoid bias.
 Collect Data: Once you have your sample, you’ll need to collect data. This
data will be used to calculate statistics that will help you test your
hypothesis.
 Perform Analysis: The collected data is then analyzed using statistical tests
such as the t-test, chi-square test, or ANOVA, to name a few. These tests help
to determine the likelihood that the results of your analysis occurred by
chance.
 Interpret Results: The analysis can provide a probability, called a p-value,
which represents the likelihood that the results occurred by chance. If this
probability is below a certain level (commonly 0.05), you may reject the null
hypothesis (the statement that there is no effect or relationship) in favor of
the alternative hypothesis (the statement that there is an effect or
relationship).
Inferential Statistics Types
Inferential statistics can be broadly categorized into two types: parametric and
nonparametric. The selection of type depends on the nature of the data and the
purpose of the analysis.
Parametric Inferential Statistics
These are statistical methods that assume data comes from a type of probability
distribution and makes inferences about the parameters of the distribution.
Common parametric methods include:

 T-tests: Used when comparing the means of two groups to see if they’re
significantly different.
 Analysis of Variance (ANOVA): Used to compare the means of more than
two groups.
 Regression Analysis: Used to predict the value of one variable (dependent)
based on the value of another variable (independent).
 Chi-square test for independence: Used to test if there is a significant
association between two categorical variables.
 Pearson’s correlation: Used to test if there is a significant linear
relationship between two continuous variables.
Nonparametric Inferential Statistics
These are methods used when the data does not meet the requirements
necessary to use parametric statistics, such as when data is not normally
distributed. Common nonparametric methods include:

 Mann-Whitney U Test: Non-parametric equivalent to the independent

samples t-test.
 Wilcoxon Signed-Rank Test: Non-parametric equivalent to the paired
samples t-test.
 Kruskal-Wallis Test: Non-parametric equivalent to the one-way ANOVA.
 Spearman’s rank correlation: Non-parametric equivalent to the Pearson
correlation.
 Chi-square test for goodness of fit: Used to test if the observed
frequencies for a categorical variable match the expected frequencies.
Inferential Statistics Formulas
Inferential statistics use various formulas and statistical tests to draw
conclusions or make predictions about a population based on a sample from
that population. Here are a few key formulas commonly used:

Confidence Interval for a Mean:

When you have a sample and want to make an inference about the population
mean (µ), you might use a confidence interval.

The formula for a confidence interval around a mean is:

[Sample Mean] ± [Z-score or T-score] * (Standard Deviation / sqrt[n]) where:

 Sample Mean is the mean of your sample data
 Z-score or T-score is the value from the Z or T distribution corresponding to
the desired confidence level (Z is used when the population standard
deviation is known or the sample size is large, otherwise T is used)
 Standard Deviation is the standard deviation of the sample
 sqrt[n] is the square root of the sample size

Hypothesis Testing:
Hypothesis testing often involves calculating a test statistic, which is then
compared to a critical value to decide whether to reject the null hypothesis.

A common test statistic for a test about a mean is the Z-score:

Z = (Sample Mean - Hypothesized Population Mean) / (Standard

Deviation / sqrt[n])
where all variables are as defined above.

Chi-Square Test:
The Chi-Square Test is used when dealing with categorical data.

The formula is:

χ² = Σ [ (Observed-Expected)² / Expected ]
where:

 Observed is the actual observed frequency

 Expected is the frequency we would expect if the null hypothesis were true
This is summed over all categories.

t-test:
The t-test is used to compare the means of two groups. The formula for the
independent samples t-test is:

t = (mean1 - mean2) / sqrt [ (sd1²/n1) + (sd2²/n2) ] where:

 mean1 and mean2 are the sample means
 sd1 and sd2 are the sample standard deviations
 n1 and n2 are the sample sizes
Inferential Statistics Examples
Sure, inferential statistics are used when making predictions or inferences about
a population from a sample of data. Here are a few real-time examples:

 Medical Research: Suppose a pharmaceutical company is developing a new

drug and they’re currently in the testing phase. They gather a sample of
1,000 volunteers to participate in a clinical trial. They find that 700 out of
these 1,000 volunteers reported a significant reduction in their symptoms
after taking the drug. Using inferential statistics, they can infer that the drug
would likely be effective for the larger population.
 Customer Satisfaction: Suppose a restaurant wants to know if its
customers are satisfied with their food. They could survey a sample of their
customers and ask them to rate their satisfaction on a scale of 1 to 10. If the
average rating was 8.5 from a sample of 200 customers, they could use
inferential statistics to infer that the overall customer population is likely
satisfied with the food.
 Political Polling: A polling company wants to predict who will win an
upcoming presidential election. They poll a sample of 10,000 eligible voters
and find that 55% prefer Candidate A, while 45% prefer Candidate B. Using
inferential statistics, they infer that Candidate A has a higher likelihood of
winning the election.
 E-commerce Trends: An e-commerce company wants to improve its
recommendation engine. They analyze a sample of customers’ purchase
history and notice a trend that customers who buy kitchen appliances also
frequently buy cookbooks. They use inferential statistics to infer that
recommending cookbooks to customers who buy kitchen appliances would
likely increase sales.
 Public Health: A health department wants to assess the impact of a health
awareness campaign on smoking rates. They survey a sample of residents
before and after the campaign. If they find a significant reduction in smoking
rates among the surveyed group, they can use inferential statistics to infer
that the campaign likely had an impact on the larger population’s smoking
habits.
Applications of Inferential Statistics
Inferential statistics are extensively used in various fields and industries to make
decisions or predictions based on data. Here are some applications of inferential
statistics:
 Healthcare: Inferential statistics are used in clinical trials to analyze the
effect of a treatment or a drug on a sample population and then infer the
likely effect on the general population. This helps in the development and
approval of new treatments and drugs.
 Business: Companies use inferential statistics to understand customer
behavior and preferences, market trends, and to make strategic decisions.
For example, a business might sample customer satisfaction levels to infer
the overall satisfaction of their customer base.
 Finance: Banks and financial institutions use inferential statistics to evaluate
the risk associated with loans and investments. For example, inferential
statistics can help in determining the risk of default by a borrower based on
the analysis of a sample of previous borrowers with similar credit
characteristics.
 Quality Control: In manufacturing, inferential statistics can be used to
maintain quality standards. By analyzing a sample of the products,
companies can infer the quality of all products and decide whether the
manufacturing process needs adjustments.
 Social Sciences: In fields like psychology, sociology, and education,
researchers use inferential statistics to draw conclusions about populations
based on studies conducted on samples. For instance, a psychologist might
use a survey of a sample of people to infer the prevalence of a particular
psychological trait or disorder in a larger population.
 Environment Studies: Inferential statistics are also used to study and
predict environmental changes and their impact. For instance, researchers
might measure pollution levels in a sample of locations to infer overall
pollution levels in a wider area.
 Government Policies: Governments use inferential statistics in policy-
making. By analyzing sample data, they can infer the potential impacts of
policies on the broader population and thus make informed decisions.
Purpose of Inferential Statistics
The purposes of inferential statistics include:

 Estimation of Population Parameters: Inferential statistics allows for the

estimation of population parameters. This means that it can provide
estimates about population characteristics based on sample data. For
example, you might want to estimate the average weight of all men in a
country by sampling a smaller group of men.
 Hypothesis Testing: Inferential statistics provides a framework for testing
hypotheses. This involves making an assumption (the null hypothesis) and
then testing this assumption to see if it should be rejected or not. This
process enables researchers to draw conclusions about population
parameters based on their sample data.
 Prediction: Inferential statistics can be used to make predictions about
future outcomes. For instance, a researcher might use inferential statistics to
predict the outcomes of an election or forecast sales for a company based
on past data.
 Relationships Between Variables: Inferential statistics can also be used to
identify relationships between variables, such as correlation or regression
analysis. This can provide insights into how different factors are related to
each other.
 Generalization: Inferential statistics allows researchers to generalize their
findings from the sample to the larger population. It helps in making broad
conclusions, given that the sample is representative of the population.
 Variability and Uncertainty: Inferential statistics also deal with the idea of
uncertainty and variability in estimates and predictions. Through concepts
like confidence intervals and margins of error, it provides a measure of how
confident we can be in our estimations and predictions.
 Error Estimation: It provides measures of possible errors (known as
margins of error), which allow us to know how much our sample results may
differ from the population parameters.
Limitations of Inferential Statistics

Inferential statistics, despite its many benefits, does have some limitations. Here
are some of them:

 Sampling Error: Inferential statistics are often based on the concept of

sampling, where a subset of the population is used to infer about the
population. There’s always a chance that the sample might not perfectly
represent the population, leading to sampling errors.
 Misleading Conclusions: If assumptions for statistical tests are not met, it
could lead to misleading results. This includes assumptions about the
distribution of data, homogeneity of variances, independence, etc.
 False Positives and Negatives: There’s always a chance of a Type I error
(rejecting a true null hypothesis, or a false positive) or a Type II error (not
rejecting a false null hypothesis, or a false negative).
 Dependence on Quality of Data: The accuracy and validity of inferential
statistics depend heavily on the quality of data collected. If data are biased,
inaccurate, or collected using flawed methods, the results won’t be reliable.
 Limited Predictive Power: While inferential statistics can provide estimates
and predictions, these are based on the current data and may not fully
account for future changes or variables not included in the model.
 Complexity: Some inferential statistical methods can be quite complex and
require a solid understanding of statistical principles to implement and
interpret correctly.
 Influenced by Outliers: Inferential statistics can be heavily influenced by
outliers. If these extreme values aren’t handled properly, they can lead to
misleading results.
 Over-reliance on P-values: There’s a tendency in some fields to overly rely
on p-values to determine significance, even though p-values have several
limitations and are often misunderstood.

What Is Retail Analytics and Why Is It

Important?
Contents
 What Is Retail Analytics?
 Current State of Retail Analytics
 Data Sources for Retail Analytics
 Critical KPIs to Track in Retail Analytics
 Data Visualization Best Practices
 Advanced Analytics Techniques
 Creating a Retail Analytics Strategy
 Retail Analytics Tools
 Retail Analytics Implementation
 Challenges and Pitfalls to Avoid
 Benefits of Retail Analytics
 Retail Analytics Next Steps

What Is Retail Analytics?

Retail analytics refers to the process of collecting, analyzing, and extracting insights
from retail data to drive smarter business decisions.
With the rise of big data and advanced analytics capabilities, retail analysis has become
a pivotal function for retailers of all sizes. Implementing a robust retail data analytics
strategy provides numerous benefits:
 Enhanced customer understanding through analysis of
customer behaviors, needs, and preferences. This enables more
personalized marketing and optimized shopping experiences.
 Improved inventory and supply chain management through
demand forecasting, inventory optimization, and analysis of
purchasing trends. This leads to reduced out-of-stocks, better
allocation of inventory, and optimizations in supply chain
logistics.
 Optimization of pricing and promotions through elasticity
modeling, price optimization, and testing. This allows retailers
to drive profitability by aligning pricing to consumer demand.
 Performance measurement through KPI tracking across
channels. Detailed metrics provide visibility into what's working
well and what needs improvement across stores, ecommerce,
mobile, etc.
 Location analytics to optimize trade area performance, site
selection, and marketing outreach. Geospatial analytics deliver
location intelligence to guide operational and strategic decisions.
 Reduced shrinkage and loss through analysis of loss trends
and exceptions. Identifying problem areas allows corrective
actions to be taken.

With many retail businesses struggling to adapt to the rise of ecommerce and shifts in
consumer preferences, a data-driven approach enabled by retail analytics has become a
competitive necessity.

Retailers can use data insights to better understand customers, optimize operations, and
strategically position retailers to thrive in an increasingly complex retail landscape.

Current State of Retail Analytics

The adoption of analytics in retail has accelerated in recent years as more retailers use data
insights to gain a competitive edge and better understand their customers in a rapidly
evolving landscape.

However, many retailers still face challenges with harnessing the true potential of their data.

Research conducted by the Retail Industry Leaders Association reveals that a mere 20% of
retailers fully leverage the capabilities of data analytics.

Common problems include data silos, poor data quality, lack of skills, and limited ability to
derive actionable insights. Legacy systems and fragmented architecture also hinder advanced
analytics initiatives.

Despite these challenges, leading retailers are pushing the boundaries of what's possible with
analytics. From predictive modeling to experimentation at scale, techniques like
reinforcement learning and graph analytics are helping leading retailers create hyper-
personalized customer experiences.
Still, most retailers have yet to tap into the full opportunity. Success requires strong executive
backing, organizational change management, updated data infrastructure, and teams
with specialized analytics skills.
With the right vision and investment, analytics can help transform retail operations, e-
commerce performance, customer experience, and overall profitability.

Data Sources for Retail Analytics

Retailers have access to a wealth of data that can empower analytics efforts.

Here are some of the key data sources:

1. POS Data - Point-of-sale (POS) systems capture transaction

data at the checkout counter. This provides insight into sales,
units sold, product mix, average transaction size, and more.
Granular POS data enables item-level analysis.
2. Inventory Data - Inventory systems track product availability
across the retail chain. This data reveals turns, stockouts, waste,
and other key operational metrics. Inventory analytics identifies
best and worst-selling items, optimizes product assortment, and
improves inventory planning.
3. CRM Data - Customer relationship management (CRM)
systems contain customer profiles with demographic, contact,
and transactional data. CRM data powers customer analytics to
identify high-value customers, segment customers, calculate
lifetime value, and target marketing campaigns.
4. Social Media Data - Social media platforms are a data
goldmine for retailers. Analyzing social data reveals brand
sentiment, competitors' positioning, and product trends.
Retailers mine social data for consumer insights to guide
marketing, product development, and other initiatives.
5. Web Analytics - Website and mobile analytics provide a wealth
of shopper data. Key metrics include traffic sources, on-site
search terms, page views, bounce rates, and online conversions.
Online behavior data enables personalized recommendations
and site optimization.

Retail organizations need to aggregate and analyze data from across all these sources to
enable impactful retail analytics programs. Integrating disparate data silos provides a single
source of truth and 360-degree customer view.

Critical KPIs to Track in Retail Analytics

Retailers today have access to more data than ever before to track key performance indicators
(KPIs) and measure success. But in the sea of available data points, focusing on the most
critical KPIs is essential to maximize impact.

Here are some of the most important KPIs for retail analytics:

Sales
Total sales revenue is, of course, a crucial metric. But more granular sales data is hugely
beneficial too. Look at sales by channel, store, product category/SKU, brand, campaign, and
more.

Identify high and low performers to optimize marketing, merchandising, and inventory.

Traffic

In-store and website traffic reveal how many customers you attract and how engagement
differs by location and marketing efforts. Traffic KPIs include:
 Website sessions and users
 Store foot traffic
 Traffic source (on-site search, referrals, social, email, etc)

Conversion Rate
The percentage of visitors that convert into customers is a powerful measure of sales
effectiveness. Conversion
rate KPIs include:
 Website conversion rate
 Sales conversion rate per store
 Email campaign conversion rates
 Call center conversion rates

Average Order Value

While conversion rate shows how many customers buy, average order value (AOV)
measures how much revenue you earn per customer. Combining AOV and conversions
illustrates revenue opportunity.

Basket Size

For online and in-store purchases, average basket size shows the number of items purchased
per transaction. Track this metric by traffic source, store location, and other segments.
Monitoring these essential KPIs provides insight into sales, traffic, customer behavior,
and more to enhance decision-making across retail organizations. Leveraging retail analytics
platforms makes accessing and acting on KPIs easier than ever.
Data Visualization Best Practices
Data visualization can make insights accessible and engaging for decision-makers
across the organization.

However, not all visualizations are created equal in effectively communicating data stories
and driving key actions. Keep these best practices in mind when designing your retail
analytics dashboards and reports:

Start with the why. Define what insights are needed and what decisions the visual
should inform before selecting the type of visualization. Pick the type of chart that will best
highlight the relationships and patterns in your data. Example types: bar charts, line
graphs, interactive
maps, scatter plots, etc.
Simplify layouts. Avoid clutter and remove elements not critical to conveying the key
information. Use titles, legends, and annotations strategically, not just as decorations. Every
added data point and design element should serve a purpose. Keep color schemes minimal
and consistent.
Prioritize key indicators. Draw attention to the most significant metrics and
dimensions. Visualize no more than 3-5 KPIs per chart, and use size, labels, and position
deliberately.
Make comparisons clear. If charting changes over time, include context like
benchmarks or goals. Encode metrics consistently across charts. Align axes across multiple
charts for easier visual comparison.
Draw attention to insights. Use highlighting, annotations, reference lines, or visual
elements to direct focus to key takeaways, anomalies, or opportunities hidden in the data.
Don't just make your audience sift through all the data unguided.
Consider functionality. Enable sorting, filtering, or drill-down interaction to allow
users to explore the data from different perspectives. But avoid overly flashy, distracting
animations.

Following visualization best practices allows you to turn your retail data into impactful
stories that drive smarter decisions.

Advanced Analytics Techniques

Retailers have access to more data than ever before. But simply collecting and visualizing
data is just the start. To gain true insights, retailers need to apply advanced analytics
techniques. Here are some of the most important ones to consider:

Forecasting
Forecasting models can help retailers predict future sales, inventory needs, and customer
behavior. Time series models like ARIMA are useful for short-term forecasts based on
historical sales patterns.

Causal forecasting looks at predictors like promotions, price changes, and events to estimate
future outcomes. AI-powered forecasting tools can surface nonlinear patterns to improve
accuracy. With better forecasts, retailers can align inventory, supply chain, marketing, and
operations.

Segmentation

Dividing customers into segments allows for more personalized marketing and
merchandising. Retailers can segment by demographics, purchase history, channel
preferences, geography, and many other factors.
Advanced segmentation may incorporate machine learning to find hidden patterns.
Dynamic segmentation analyzes behavior in real time. For example, if a customer's web
browsing shows intent for an upcoming trip, they can be targeted with relevant offers.

Predictions

Retail analytics isn't just about reporting on the past. Advanced models can make predictions
about the future.

For example, uplift modeling identifies customer response to specific offers, churn models
predict the likelihood of customer attrition, and recommendation engines forecast which
products each customer will likely purchase.

Predictions enable retailers to take proactive actions to influence future outcomes. The most
accurate models involve a combination of machine-learning algorithms and human insight.
With the right data foundation in place, advanced analytics takes retail intelligence to the next
level. Techniques like forecasting, segmentation, and predictions move retailers from reactive
to proactive.

Prescriptive analytics can even recommend the optimal actions to achieve desired
outcomes. However, advanced analytics requires investment in skilled analysts, modeling
tools, and change management. The high-impact insights uncovered are well worth the effort
for forward-thinking retailers.

Retail Analytics Implementation

Implementing analytics for retail requires careful planning and execution. Here are the key
steps for adopting an effective analytics program:
1. Assemble a Dedicated Team - Appoint an analytics leader and
team to own the program. Include both technical and business
experts who can identify key questions, interpret data, and drive
actions. Consider skills in statistics, business analysis, data
engineering, and translating insights.
2. Take Inventory of Your Data Assets - Document all existing
data sources, types, and accessibility. Common sources are
point-of-sale systems, customer databases, inventory systems, e-
commerce platforms, CRM software, etc. Identify gaps where
additional data collection would help.
3. Set Your Analytics Goals - Define the critical business
questions you want to answer with analytics, such as improving
conversions, reducing churn, optimizing assortment, etc.
Connect goals to metrics and KPIs that will measure progress.
Prioritize opportunities with the biggest potential impact.
4. Build the Analytics Infrastructure - Create data pipelines to
move, transform, and model data. Set up a BI platform and
reporting dashboard to track KPIs. Ensure flexible scalability for
future needs. Leverage cloud services to reduce costs.
Implement data governance and access controls.
5. Develop In-House Expertise - Invest in analytics training for
both technical and non-technical employees. Foster data literacy
across the organization. Hire specialists like data scientists and
business analysts as needed. Create a center of excellence.
6. Act on the Insights - Embed analytics into everyday decision-
making. Build a culture that values evidence and
experimentation. Continually optimize and innovate based on
the data. Automate insights with predictive analytics and
machine learning where feasible.
7. Measure Results and Iterate - Track the performance of
analytics initiatives. Demonstrate business impact. Refine
approaches based on feedback and new questions. Expand
analytics into new areas. Maintain focus on driving real value.

Challenges and Pitfalls to Avoid

Implementing retail analytics comes with some common challenges and pitfalls that retailers
should be aware of. Avoiding or mitigating these issues is key to getting the most value from
your analytics program.
Data Silos

Retailers often have data spread across various systems and databases like POS, e-commerce,
CRM, inventory management, etc.

Bringing this disparate data together into a unified view is critical for getting a complete
picture of customers and performance. Lack of data integration leads to blindspots and lost
opportunities.

Dirty Data

Bad data quality hampers the accuracy and reliability of retail analytics. Issues like
incomplete data, errors, outdated information, duplication, etc., can undermine insights.
Retailers need to invest in data governance, cleansing, and management to trust the
output of analytics.

Not Actionable
Retail analytics that does not lead to any action loses its purpose. Reports and dashboards full
of vanity metrics fail to impact decisions or outcomes. Analytics should be connected with
operational workflows and key business priorities to drive real value.

Benefits of Retail Analytics

Retail analytics presents an enormous opportunity for retailers to gain a competitive edge. By
leveraging data and analytics, retailers can optimize everything from inventory management
to marketing campaigns.

The key opportunities include:

 Competitive Advantage - Retail analytics enables data-driven

decision-making. Retailers that master analytics can outperform
the competition by optimizing pricing, promotions, product
assortment, staffing, and more. They gain insights faster and
adjust strategies to maximize performance.
 Inventory Optimization - With analytics, retailers can
accurately forecast demand to minimize excess inventory. They
can optimize inventory levels across locations. Retailers can
also use analytics to minimize out-of-stocks and optimize
product availability.
 Enhanced Customer Experiences - By analyzing customer
data, retailers can better understand customer needs and
preferences. They can tailor promotions, recommendations, and
experiences to drive brand loyalty. Location analytics helps
optimize store layouts.
 Improved Marketing Effectiveness - Retailers can test and
analyze the results of campaigns to determine the most effective
strategies. They can target marketing efforts more precisely
based on customer insights.
 Loss Prevention - Analyzing point-of-sale data, inventory data
and other sources helps identify sources of shrinkage like theft,
fraud or process gaps. Retailers can take preventative action and
reduce loss.
 Labor Cost Optimization - Retailers can use analytics to
predict store traffic patterns and staff stores accordingly.
Employee scheduling can be optimized based on sales forecasts.
 New Services and Business Models - Analytics insights can
reveal unmet customer needs and untapped opportunities, which
can inspire new services, store formats or business models.

Data Quality
No ratings yet
Data Quality
76 pages
Quality Assurance Unit 1 (Data)
No ratings yet
Quality Assurance Unit 1 (Data)
17 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
Dataqualitymanagement
No ratings yet
Dataqualitymanagement
20 pages
Data Quality Concepts PDF
100% (3)
Data Quality Concepts PDF
83 pages
Data Quality
No ratings yet
Data Quality
4 pages
Talend Data Quality Guide
No ratings yet
Talend Data Quality Guide
45 pages
Guide Real Talk A Guide To Understanding Data Quality and Data Observability
No ratings yet
Guide Real Talk A Guide To Understanding Data Quality and Data Observability
36 pages
Data Quality MDM
No ratings yet
Data Quality MDM
20 pages
DOMS402
No ratings yet
DOMS402
367 pages
Data Quality
No ratings yet
Data Quality
6 pages
Report Week 1
No ratings yet
Report Week 1
14 pages
Data Quality Lec 3
No ratings yet
Data Quality Lec 3
3 pages
Lect 6
No ratings yet
Lect 6
36 pages
Techniquesfor Ensuring Data Quality
No ratings yet
Techniquesfor Ensuring Data Quality
19 pages
Data Quality Best Practices Detailed Presentation
No ratings yet
Data Quality Best Practices Detailed Presentation
11 pages
Module 7. Data Quality
No ratings yet
Module 7. Data Quality
42 pages
Ba - Data Quality
No ratings yet
Ba - Data Quality
2 pages
Data Quality
No ratings yet
Data Quality
2 pages
TK Starter Kata
No ratings yet
TK Starter Kata
27 pages
Data Quality's Role in Analytics
No ratings yet
Data Quality's Role in Analytics
13 pages
Data Quality
No ratings yet
Data Quality
6 pages
Data Quality Fundamentals: Agenda
No ratings yet
Data Quality Fundamentals: Agenda
47 pages
A Guide To Data Quality Management
No ratings yet
A Guide To Data Quality Management
16 pages
Data Management Challenges
0% (1)
Data Management Challenges
9 pages
Data Quality - 079 Moumon
No ratings yet
Data Quality - 079 Moumon
8 pages
Isom Midterms
No ratings yet
Isom Midterms
27 pages
Unit 2
No ratings yet
Unit 2
22 pages
LECTURE The PROCESS Phase
No ratings yet
LECTURE The PROCESS Phase
6 pages
DataQuality Session2
No ratings yet
DataQuality Session2
39 pages
M3A1
No ratings yet
M3A1
7 pages
Session-3 Data Quality and Soft Skills
No ratings yet
Session-3 Data Quality and Soft Skills
13 pages
Data Stewardship Is Everybody's Business - Best Practices For Data Quality Management - Innovation Insights
No ratings yet
Data Stewardship Is Everybody's Business - Best Practices For Data Quality Management - Innovation Insights
5 pages
Importance
No ratings yet
Importance
24 pages
AIA DQG IDQ Approach& Features v1.1
No ratings yet
AIA DQG IDQ Approach& Features v1.1
29 pages
A Summer Training Project Report ON: Consumer Behaviour & Sales Promotion
No ratings yet
A Summer Training Project Report ON: Consumer Behaviour & Sales Promotion
99 pages
Data and Analytics Essentials Data Quality - Gartner - 202212
No ratings yet
Data and Analytics Essentials Data Quality - Gartner - 202212
3 pages
BT BTD Data Quality v1
No ratings yet
BT BTD Data Quality v1
36 pages
White Paper: 1 Definitive Guide To Data Quality
No ratings yet
White Paper: 1 Definitive Guide To Data Quality
18 pages
Data Quality
No ratings yet
Data Quality
13 pages
MIS Data Quality Challenges
No ratings yet
MIS Data Quality Challenges
10 pages
Data Quality Essentials Guide
100% (1)
Data Quality Essentials Guide
12 pages
Linguistics Context Critique
No ratings yet
Linguistics Context Critique
16 pages
Data Quality
No ratings yet
Data Quality
5 pages
Data Quality Management Best Practices
No ratings yet
Data Quality Management Best Practices
9 pages
Data Quality Essentials for Businesses
No ratings yet
Data Quality Essentials for Businesses
15 pages
Introduction of Quantum Mechanics
50% (2)
Introduction of Quantum Mechanics
2 pages
Big Data - Work Program - 06 - Data Quality Management (10 24 2013)
No ratings yet
Big Data - Work Program - 06 - Data Quality Management (10 24 2013)
7 pages
DataQuality Submit
No ratings yet
DataQuality Submit
11 pages
Data Quality
No ratings yet
Data Quality
10 pages
Predictive Self-Service Data Quality Tool
No ratings yet
Predictive Self-Service Data Quality Tool
17 pages
Data Quality Management Guide
No ratings yet
Data Quality Management Guide
14 pages
Data Skills For Business
No ratings yet
Data Skills For Business
14 pages
Five Fundamental Data Quality Practices - WP
No ratings yet
Five Fundamental Data Quality Practices - WP
12 pages
Common Sources of Error in Physics Lab Experiments
0% (1)
Common Sources of Error in Physics Lab Experiments
3 pages
WP Dirty Data Omni
No ratings yet
WP Dirty Data Omni
13 pages
TDWI SoDQ Kobielus Web
No ratings yet
TDWI SoDQ Kobielus Web
24 pages
Scope of Science Teaching
No ratings yet
Scope of Science Teaching
19 pages
Statistics Formulae Sheet: X X N X F - X N L+ I F N - C) FM F 1) FM F 1) + (FM F 2) × I Lowest Value+highest Value
No ratings yet
Statistics Formulae Sheet: X X N X F - X N L+ I F N - C) FM F 1) FM F 1) + (FM F 2) × I Lowest Value+highest Value
4 pages
Instrumental Analysis Old Question Paper
50% (2)
Instrumental Analysis Old Question Paper
2 pages
Business Statistics Report Sample
100% (2)
Business Statistics Report Sample
12 pages
Data Quality - Trusted Data Across The Entreprise - Overview
100% (1)
Data Quality - Trusted Data Across The Entreprise - Overview
14 pages
5 Fundamental Data Quality Practices
No ratings yet
5 Fundamental Data Quality Practices
12 pages
Introduction To Descriptive Analytics
No ratings yet
Introduction To Descriptive Analytics
21 pages
Data Warehouse: FPT University Hanoi 2010
No ratings yet
Data Warehouse: FPT University Hanoi 2010
32 pages
Forester White Paper - Build Trusted Data With Data Quality
100% (1)
Forester White Paper - Build Trusted Data With Data Quality
12 pages
Research Methods
38% (8)
Research Methods
55 pages
Statistics - Statistical Inference
No ratings yet
Statistics - Statistical Inference
3 pages
327 329
No ratings yet
327 329
3 pages
YUSI AssignmentModule6
No ratings yet
YUSI AssignmentModule6
4 pages
Applicability of Using GIN Method, by Considering
No ratings yet
Applicability of Using GIN Method, by Considering
18 pages
Maths 9709 Paper 5 Format 1 - Discrete Random Variables
No ratings yet
Maths 9709 Paper 5 Format 1 - Discrete Random Variables
97 pages
3rd-Quarter-Exam-3is 2023-2024
No ratings yet
3rd-Quarter-Exam-3is 2023-2024
5 pages
Understanding Chapter 2 of Research
No ratings yet
Understanding Chapter 2 of Research
2 pages
4 5866148298032679873
No ratings yet
4 5866148298032679873
24 pages
Thesis Writing Chapter 1
No ratings yet
Thesis Writing Chapter 1
66 pages
Research Methods For Business
No ratings yet
Research Methods For Business
11 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
35 pages
Nordic Studies in Pragmatism: Pentti M A Att Anen
No ratings yet
Nordic Studies in Pragmatism: Pentti M A Att Anen
11 pages
Sampling Techniques Guide
No ratings yet
Sampling Techniques Guide
6 pages
RIA MATRIX Final
No ratings yet
RIA MATRIX Final
5 pages
Chapter 4 Experimental Research Designs
No ratings yet
Chapter 4 Experimental Research Designs
21 pages
Sta416 Chapter 1
No ratings yet
Sta416 Chapter 1
23 pages
English Ed Research Stats Guide
No ratings yet
English Ed Research Stats Guide
8 pages
Research Design & Analysis Guide
No ratings yet
Research Design & Analysis Guide
27 pages
Sampling Distributions & Confidence Interval
No ratings yet
Sampling Distributions & Confidence Interval
42 pages

Unit 2 More Notes

Uploaded by

Unit 2 More Notes

Uploaded by

What is Data Quality and Why is it Important?

 What is Data Quality?

What’s the Definition of Data Quality?

But What Is Data Quality, in Practical Terms?

Data Quality Dimensions

How Do You Improve Data Quality?

 Data Quality Reporting

 Master Data Management (MDM)

 Customer Data Integration (CDI)

 Product Information Management (PIM)

Data Quality Best Practices

Why does data cleaning matter?

 Positive frame: I feel well rested when I wake up in the morning.

Dirty vs clean data

Dirty data Clean data

Duplicate entries Unique

Example: Inaccurate dataYou ask survey respondents the following question:

How often do you go grocery shopping in person?

Example: Incomplete dataIn an online survey, a participant starts entering a response to an

How do you clean data?

Data cleaning workflow

1. Apply data validation techniques to prevent dirty data entry.

Range constraints: Values must fall within a certain range to be valid.

Mandatory constraints: A value must be entered.

Step 1: Straighten up your dataset

Step 2: Visually scan your data for possible discrepancies

 Are there formatting irregularities for dates, or textual or numerical data?

Step 3: Use statistical techniques and tables/graphs to explore data

1. Explore your data visually with boxplots, scatterplots, or histograms

These are some of the responses:

All five values will be accepted with fuzzy string-matching.

Fuzzy string-matching is generally preferable to strict string-matching because more

Missing data can come from random or systematic causes.

Dealing with missing data

 Accepting the data as they are

You can use several methods to detect outliers:

Dealing with outliers

Inferential Statistics – Types, Methods and

 Define a Hypothesis: This is often a statement about a parameter of a

 Mann-Whitney U Test: Non-parametric equivalent to the independent

Confidence Interval for a Mean:

The formula for a confidence interval around a mean is:

[Sample Mean] ± [Z-score or T-score] * (Standard Deviation / sqrt[n]) where:

A common test statistic for a test about a mean is the Z-score:

Z = (Sample Mean - Hypothesized Population Mean) / (Standard

The formula is:

 Observed is the actual observed frequency

t = (mean1 - mean2) / sqrt [ (sd1²/n1) + (sd2²/n2) ] where:

 Medical Research: Suppose a pharmaceutical company is developing a new

 Estimation of Population Parameters: Inferential statistics allows for the

 Sampling Error: Inferential statistics are often based on the concept of

What Is Retail Analytics and Why Is It

What Is Retail Analytics?

Current State of Retail Analytics

Data Sources for Retail Analytics

Here are some of the key data sources:

1. POS Data - Point-of-sale (POS) systems capture transaction

Critical KPIs to Track in Retail Analytics

Average Order Value

Advanced Analytics Techniques

Retail Analytics Implementation

Challenges and Pitfalls to Avoid

Benefits of Retail Analytics

The key opportunities include:

 Competitive Advantage - Retail analytics enables data-driven

You might also like