Unit 2 More Notes
Unit 2 More Notes
The significance of data quality in today’s data-driven environment is quite
impossible to emphasize. The quality of data becomes more crucial when
organizations, businesses and individuals depend more and more on it for their
work.
This article gives detailed idea of what is data quality? its importance, and the
essential procedures for making sure the data we use is correct, reliable, and
appropriate for the reason for which data was collected.
Data helps people and organizations make more informed decisions, significantly increasing
the likelihood of success. By all accounts, that seems to indicate that large amounts of data
are a good thing. However, that’s not always the case. Sometimes data is incomplete,
incorrect, redundant, or not applicable to the user’s needs.
But fortunately, we have the concept of data quality to help make the job easier. So let’s
explore what is data quality, including what are its characteristics and best practices, and how
we can use it to make data better.
In simple terms, data quality tells us how reliable a particular set of data is and whether or not
it will be good enough for a user to employ in decision-making. This quality is often
measured by degrees.
Data quality measures the condition of data, relying on factors such as how useful it is to the
specific purpose, completeness, accuracy, timeliness (e.g., is it up to date?), consistency,
validity, and uniqueness.
Data quality analysts are responsible for conducting data quality assessments, which involve
assessing and interpreting every quality data metric. Then, the analyst creates an aggregate
score reflecting the data’s overall quality and gives the organization a percentage rating that
shows how accurate the data is.
To put the definition in more direct terms, data quality indicates how good the data is and
how useful it is for the task at hand. But the term also refers to planning, implementing, and
controlling the activities that apply the needed quality management practices and techniques
required to ensure the data is actionable and valuable to the data consumers.
Now, let us look at data quality dimensions after you better understand what is data quality.
There are six primary, or core, dimensions to data quality. These are the metrics analysts use
to determine the data’s viability and its usefulness to the people who need it.
Accuracy
The data must conform to actual, real-world scenarios and reflect real-world objects and
events. Analysts should use verifiable sources to confirm the measure of accuracy,
determined by how close the values jibe with the verified correct information sources.
Completeness
Completeness measures the data's ability to deliver all the mandatory values that are available
successfully.
Consistency
Data consistency describes the data’s uniformity as it moves across applications and networks
and when it comes from multiple sources. Consistency also means that the same datasets
stored in different locations should be the same and not conflict. Note that consistent data can
still be wrong.
Timeliness
Timely data is information that is readily available whenever it’s needed. This dimension also
covers keeping the data current; data should undergo real-time updates to ensure that it is
always available and accessible.
Uniqueness
Uniqueness means that no duplications or redundant information are overlapping across all
the datasets. No record in the dataset exists multiple times. Analysts use data cleansing and
deduplication to help address a low uniqueness score.
Validity
Data must be collected according to the organization’s defined business rules and parameters.
The information should also conform to the correct, accepted formats, and all dataset values
should fall within the proper range.
People looking for ideas on how to improve data quality turn to data quality management for
answers. Data quality management aims to leverage a balanced set of solutions to prevent
future data quality issues and clean (and ideally eventually remove) data that fails to meet
data quality KPIs (Key Performance Indicators). These actions help businesses meet their
current and future objectives.
There is more to data quality than just data cleaning. With that in mind, here are the eight
mandatory disciplines used to prevent data quality problems and improve data quality by
cleansing the information of all bad data:
Data Governance
Data governance spells out the data policies and standards that determine the required data
quality KPIs and which data elements should be focused on. These standards also include
what business rules must be followed to ensure data quality.
Data Profiling
Data profiling is a methodology employed to understand all data assets that are part of data
quality management. Data profiling is crucial because many of the assets in question have
been populated by many different people over the years, adhering to different standards.
Data Matching
Data matching technology is based on match codes used to determine if two or more bits of
data describe the same real-world thing. For instance, say there’s a man named Michael
Jones. A customer dataset may have separate entries for Mike Jones, Mickey Jones, Jonesy,
Big Mike Jones, and Michael Jones, but they’re all describing one individual.
Information gathered from data profiling, and data matching can be used to measure data
quality KPIs. Reporting also involves operating a quality issue log, which documents known
data issues and any follow-up data cleansing and prevention efforts.
Master Data Management frameworks are great resources for preventing data quality issues.
MDM frameworks deal with product master data, location master data, and party master data.
CDI involves compiling customer master data gathered via CRM applications, self-service
registration sites. This information must be compiled into one source of truth.
Manufacturers and sellers of goods need to align their data quality KPIs with each other so
that when customers order a product, it will be the same item at all stages of the supply chain.
Thus, much of PIM involves creating a standardized way to receive and present product data.
Digital Asset Management (DAM)
Digital assets cover items like videos, text documents, images, and similar files, used
alongside product data. This discipline involves ensuring that all tags are relevant and the
quality of the digital assets.
Data analysts who strive to improve data quality need to follow best practices to meet their
objectives. Here are ten critical best practices to follow:
Make sure that top-level management is involved. Data analysts can resolve many data
quality issues through cross-departmental participation.
Include data quality activity management as part of your data governance framework. The
framework sets data policies and data standards, the required roles and offers a business
glossary.
Each data quality issue raised must begin with a root cause analysis. If you don’t address
the root cause of a data issue, the problem will inevitably appear again. Don’t just address
the symptoms of the disease; you need to cure the disease itself.
Maintain a data quality issue log. Each issue needs an entry, complete with information
regarding the assigned data owner, the involved data steward, the issue’s impact, the final
resolution, and the timing of any necessary proceedings.
Fill data owner and data steward roles from your company’s business side and fill data
custodian roles from either business or IT whenever possible and makes the most sense.
Use examples of data quality disasters to raise awareness about the importance of data
quality. However, while anecdotes are great for illustrative purposes, you should rely on
fact-based impact and risk analysis to justify your solutions and their required funding.
Your organization’s business glossary must serve as the foundation for metadata
management.
Avoid typing in data where possible. Instead, explore cost-effective solutions for data
onboarding that employ third-party data sources that provide publicly available data. This
data includes items such as names, locations in general, company addresses and IDs, and
in some cases, individual people. When dealing with product data, use second-party data
from trading partners whenever you can.
When resolving data issues, make every effort to implement relevant processes and
technology that stops the problems from arising as close as possible to the data onboarding
point instead of depending on downstream data cleansing.
Establish data quality KPIs that work in tandem with the general KPIs for business
performance. Data quality KPIs, sometimes called Data Quality Indicators (DQIs), can
often be associated with data quality dimensions like uniqueness, completeness, and
consistency.
Data Cleaning |
Data cleaning involves spotting and resolving potential data inconsistencies or
errors to improve your data quality. An error is any value (e.g., recorded weight) that
doesn’t reflect the true value (e.g., actual weight) of whatever is being measured.
Example: Quantitative researchYou investigate whether a new drug reduces the effects of
fatigue.
You survey participants before and at the end of the drug treatment. Using closed-
ended questions, you ask Likert-scale questions about participants’ experiences and
symptoms on a 1-to-7 scale
Errors are often inevitable, but cleaning your data helps you minimise them. If you
don’t remove or resolve these errors, you could end up with a false or invalid study
conclusion.
Example: Data errorsMost of the questions are framed positively, but some questions have
negative frames to engage the participants.
Question: Please rate the extent to which you agree or disagree with these
statements from 1 to 7.
Both questions measure the same thing: how respondents feel after waking up in the
morning. But the answers to negatively worded questions need to be reverse-coded
before analysis so that all answers are consistently in the same direction.
Reverse coding means flipping the number scale in the opposite direction so that an
extreme value (e.g., 1 or 7) means the same thing for each question.
If you forget to reverse-code these answers before analysis, you may end up with an
invalid conclusion because of data errors.
With inaccurate or invalid data, you might make a Type I or II error in your
conclusion. These types of erroneous conclusions can be practically significant with
important consequences, because they lead to misplaced investments or missed
opportunities.
Example: Type I errorBased on the results, you make a Type I error. You conclude that the
drug is effective when it’s not.
Your organisation decides to invest in this new drug and people are prescribed the
drug instead of effective therapies.
Clean data meet some requirements for high quality while dirty data are flawed in
one or more ways. Let’s compare dirty with clean data.
Invalid Valid
Inaccurate Accurate
Incomplete Complete
Inconsistent Consistent
Incorrectly Uniform
formatted
Valid data
Valid data conform to certain requirements for specific types of information (e.g.,
whole numbers, text, dates). Invalid data don’t match up with the possible values
accepted for that observation.
Example: Data validationA date of birth on a form may only be recognised if it’s formatted a
certain way, for example, as dd-mm-yyyy, if you use data validation techniques.
The day field will allow numbers up to 31, the month field up to 12, and the year field
up to 2021. If any numbers exceed those values, the form won’t be submitted.
Without valid data, your data analysis procedures may not make sense. It’s best to
use data validation techniques to make sure your data are in the right formats before
you analyse them.
Accurate data
In measurement, accuracy refers to how close your observed value is to the true
value. While data validity is about the form of an observation, data accuracy is about
the actual content.
Every day
Once a week
Biweekly
Once a month
Less than once a month
Never
Some of the respondents select ‘biweekly’ as their answer. But this word can mean
either twice a week or once every two weeks, and these are fairly different
frequencies.
You have no idea how each person interpreted this word, so your data are
inaccurate because of inadequate response items.
Complete data
Complete data are measured and recorded thoroughly. Incomplete data are
statements or records with missing information.
Reconstructing missing data isn’t easy to do. Sometimes, you might be able to
contact a participant and ask them to redo a survey or an interview, but you might
not get the answer that you would have otherwise.
Consistent data
Clean data are consistent across a dataset. For each member of your sample, the
data for different variables should line up to make sense logically.
Example: Inconsistent dataIn your survey, you collect information about demographic
variables, including age, ethnicity, education level, and socioeconomic status. One
participant enters ’13’ for their age and PhD-level education as their highest attained degree.
These data are inconsistent because it’s highly unlikely for a 13-year-old to hold a
doctorate degree in your specific sample. It’s more likely that an incorrect age was
entered.
Unique data
In data collection, you may accidentally record data from the same participant twice.
Example: Duplicate entriesIn an online survey, a participant fills in the questionnaire and hits
enter twice to submit it. The data gets reported twice on your end.
It’s important to review your data for identical entries and remove any duplicate
entries in data cleaning. Otherwise, your data might be skewed.
Uniform data
Uniform data are reported using the same units of measure. If data aren’t all in the
same units, they need to be converted to a standard measure.
Example: Nonuniform dataIn a survey, you ask participants to enter their gross salary in
pounds.
Some participants respond with their monthly salary, while others report their annual
salary.
Unless you provide a time unit, participants may answer this question using different
time frames. You won’t know for sure whether they’re reporting their monthly or
annual salary.
Data cleaning is a difficult process because errors are hard to pinpoint once the data
are collected. You’ll often have no way of knowing if a data point reflects the actual
value of something accurately and precisely.
In practice, you may focus instead on finding and resolving data points that don’t
agree or fit with the rest of your dataset in more obvious ways. These data might be
missing values, outliers, incorrectly formatted, or irrelevant.
You can choose a few techniques for cleaning data based on what’s appropriate.
What you want to end up with is a valid, consistent, unique, and uniform dataset
that’s as complete as possible.
Not all of these steps will be relevant to every dataset. You can carefully apply data
cleaning techniques where necessary, with clear documentation of your processes
for transparency.
By documenting your workflow, you ensure that other people can review and
replicate your procedures.
Data validation
Data validation involves applying constraints to make sure you have valid and
consistent data. It’s usually applied even before you collect data, when designing
questionnaires or other measurement materials requiring manual data entry.
Different data validation constraints help you minimise the amount of data cleaning
you’ll need to do.
Data-type constraints: Values can only be accepted if they are of a certain type,
such as numbers or text.
Example: Data-type constraintIf a date is entered with both text and numbers (e.g., 20 March
2021), instead of just numbers (e.g., 20-03-2021), it will not be accepted.
Example: Range constraintYou design a questionnaire for a target population with ages
ranging from 18 to 45. When reporting age, participants can only enter a value between 18
and 45 to proceed with the form.
Example: Mandatory constraintParticipants filling in a form must select a button that says ‘I
consent’ to begin.
Data screening
Once you’ve collected your data, it’s best to create a backup of your original dataset
and store it safely. If you make any mistakes in your workflow, you can always start
afresh by duplicating the backup and working from the new copy of your dataset.
Data screening involves reviewing your dataset for inconsistent, invalid, missing, or
outlier data. You can do this manually or with statistical methods.
Turn each variable (measure) into a column and each case (participant) into a row.
Give your columns unique and logical names.
Remove any empty rows from your dataset.
Make note of these issues and consider how you’ll address them in your data
cleaning procedure.
You can get a rough idea of how your quantitative variable data are distributed by
visualising them. Boxplots and scatterplots can show how your data are distributed
and whether you have any extreme values. It’s important to check whether your
variables are normally distributed so that you can select appropriate statistical
tests for your research.
If your mean, median, and mode all differ from each other by a lot, there may be
outliers in the dataset that you should look into.
Data diagnosing
After a general overview, you can start getting into the nitty-gritty of your dataset.
You’ll need to create a standard procedure for detecting and treating different types
of data.
Without proper planning, you might end up cherry-picking only some data points to
clean, leading to a biased dataset.
Here we’ll focus on ways to deal with common problems in dirty data:
Duplicate data
Invalid data
Missing values
Outliers
De-duplication
De-duplication means detecting and removing any identical copies of data, leaving
only unique cases or participants in your dataset.
Example: De-duplicationYou compile your data in a spreadsheet where the columns are the
questions and the rows are the participants. Each row contains one participant’s data.
You sort the data by a column and review the data row by row to check whether
there are any identical rows. You remove identical copies of a row.
If duplicate data are left in the dataset, they will bias your results. Some participants’
data will be weighted more heavily than others’.
Invalid data
Using data standardisation, you can identify and convert data from varying formats
into a uniform format.
Unlike data validation, you can apply standardisation techniques to your data after
you’ve collected it. This involves developing codes to convert your dirty data into
consistent and valid formats.
Data standardisation is helpful if you don’t have data constraints at data entry or if
your data have inconsistent formats.
Example: Invalid dataUsing an open-ended question, you ask participants to report their age.
Your responses contain a mix of numbers and text, with some typos.
23
twenty
19
eihgteen
22
String-matching methods
To standardise inconsistent data, you can use strict or fuzzy string-matching
methods to identify exact or close matches between your data and valid values.
A string is a sequence of characters. You compare your data strings to the valid
values you expect to obtain and then remove or transform the strings that don’t
match.
Strict string-matching: Any strings that don’t match the valid values exactly are
considered invalid.
Example: Strict string-matchingYour valid values include numbers between 18 and 45 and
any correctly spelled words denoting numbers with the first letter capitalised.
In this case, only 3 out of 5 values will be accepted with strict matching.
23
twenty
19
eihgteen
22
Fuzzy string-matching: Strings that closely or approximately match valid values are
recognised and corrected.
Example: Fuzzy string-matchingYour valid values include numbers between 18 and 45 and
any words denoting numbers. You use a computer program to allow any values that closely
match these valid values in your dataset.
For closely matching strings, your program checks how many edits are needed to
change the string into a valid value, and if the number of edits is small enough, it
makes those changes.
23
twenty
19
eihgteen
22
After matching, you can transform your text data into numbers so that all values are
consistently formatted.
Missing data
In any dataset, there’s usually some missing data. These cells appear blank in your
spreadsheet.
Random missing data include data entry errors, inattention errors, or misreading of
measures.
Non-random missing data result from confusing, badly designed, or inappropriate
measurements or questions.
Random missing data are usually left alone, while non-random missing data may
need removal or replacement.
With deletion, you remove participants with missing data from your analyses. But
your sample may become smaller than intended, so you might lose statistical power.
Example: Missing data removalYou decide to remove all participants with missing data from
your survey dataset. This reduces your sample from 114 participants to 77.
Alternatively, you can use imputation to replace a missing value with another value
based on a reasonable estimate. You use other data to replace the missing value for
a more complete dataset.
It’s important to apply imputation with caution, because there’s a risk of bias or
inaccuracy.
Outliers
Outliers are extreme values that differ from most other data points in a dataset.
Outliers can be true values or errors.
True outliers should always be retained, because these just represent natural
variations in your sample. For example, athletes training for a 100-metre Olympic
sprint have much higher speeds than most people in the population. Their sprint
speeds are natural outliers.
Outliers can also result from measurement errors, data entry errors, or
unrepresentative sampling. For example, an extremely low sprint time could be
recorded if you misread the timer.
Detecting outliers
Outliers are always at the extreme ends of any variable dataset.
Sorting your values from low to high and checking minimum and maximum values
Visualising your data in a boxplot and searching for outliers
Using statistical procedures to identify extreme values
In general, you should try to accept outliers as much as possible unless it’s clear that
they represent errors or bad data.
It’s important to document each outlier you remove and the reasons so that other
researchers can follow your procedures.
Inferential Statistics
Inferential statistics is a branch of statistics that involves making predictions or
inferences about a population based on a sample of data taken from that
population. It is used to analyze the probabilities, assumptions, and outcomes of
a hypothesis.
The basic steps of inferential statistics typically involve the following:
T-tests: Used when comparing the means of two groups to see if they’re
significantly different.
Analysis of Variance (ANOVA): Used to compare the means of more than
two groups.
Regression Analysis: Used to predict the value of one variable (dependent)
based on the value of another variable (independent).
Chi-square test for independence: Used to test if there is a significant
association between two categorical variables.
Pearson’s correlation: Used to test if there is a significant linear
relationship between two continuous variables.
Nonparametric Inferential Statistics
These are methods used when the data does not meet the requirements
necessary to use parametric statistics, such as when data is not normally
distributed. Common nonparametric methods include:
Hypothesis Testing:
Hypothesis testing often involves calculating a test statistic, which is then
compared to a critical value to decide whether to reject the null hypothesis.
Chi-Square Test:
The Chi-Square Test is used when dealing with categorical data.
χ² = Σ [ (Observed-Expected)² / Expected ]
where:
t-test:
The t-test is used to compare the means of two groups. The formula for the
independent samples t-test is:
Inferential statistics, despite its many benefits, does have some limitations. Here
are some of them:
With many retail businesses struggling to adapt to the rise of ecommerce and shifts in
consumer preferences, a data-driven approach enabled by retail analytics has become a
competitive necessity.
Retailers can use data insights to better understand customers, optimize operations, and
strategically position retailers to thrive in an increasingly complex retail landscape.
However, many retailers still face challenges with harnessing the true potential of their data.
Research conducted by the Retail Industry Leaders Association reveals that a mere 20% of
retailers fully leverage the capabilities of data analytics.
Common problems include data silos, poor data quality, lack of skills, and limited ability to
derive actionable insights. Legacy systems and fragmented architecture also hinder advanced
analytics initiatives.
Despite these challenges, leading retailers are pushing the boundaries of what's possible with
analytics. From predictive modeling to experimentation at scale, techniques like
reinforcement learning and graph analytics are helping leading retailers create hyper-
personalized customer experiences.
Still, most retailers have yet to tap into the full opportunity. Success requires strong executive
backing, organizational change management, updated data infrastructure, and teams
with specialized analytics skills.
With the right vision and investment, analytics can help transform retail operations, e-
commerce performance, customer experience, and overall profitability.
Retail organizations need to aggregate and analyze data from across all these sources to
enable impactful retail analytics programs. Integrating disparate data silos provides a single
source of truth and 360-degree customer view.
Here are some of the most important KPIs for retail analytics:
Sales
Total sales revenue is, of course, a crucial metric. But more granular sales data is hugely
beneficial too. Look at sales by channel, store, product category/SKU, brand, campaign, and
more.
Identify high and low performers to optimize marketing, merchandising, and inventory.
Traffic
In-store and website traffic reveal how many customers you attract and how engagement
differs by location and marketing efforts. Traffic KPIs include:
Website sessions and users
Store foot traffic
Traffic source (on-site search, referrals, social, email, etc)
Conversion Rate
The percentage of visitors that convert into customers is a powerful measure of sales
effectiveness. Conversion
rate KPIs include:
Website conversion rate
Sales conversion rate per store
Email campaign conversion rates
Call center conversion rates
Basket Size
For online and in-store purchases, average basket size shows the number of items purchased
per transaction. Track this metric by traffic source, store location, and other segments.
Monitoring these essential KPIs provides insight into sales, traffic, customer behavior,
and more to enhance decision-making across retail organizations. Leveraging retail analytics
platforms makes accessing and acting on KPIs easier than ever.
Data Visualization Best Practices
Data visualization can make insights accessible and engaging for decision-makers
across the organization.
However, not all visualizations are created equal in effectively communicating data stories
and driving key actions. Keep these best practices in mind when designing your retail
analytics dashboards and reports:
Start with the why. Define what insights are needed and what decisions the visual
should inform before selecting the type of visualization. Pick the type of chart that will best
highlight the relationships and patterns in your data. Example types: bar charts, line
graphs, interactive
maps, scatter plots, etc.
Simplify layouts. Avoid clutter and remove elements not critical to conveying the key
information. Use titles, legends, and annotations strategically, not just as decorations. Every
added data point and design element should serve a purpose. Keep color schemes minimal
and consistent.
Prioritize key indicators. Draw attention to the most significant metrics and
dimensions. Visualize no more than 3-5 KPIs per chart, and use size, labels, and position
deliberately.
Make comparisons clear. If charting changes over time, include context like
benchmarks or goals. Encode metrics consistently across charts. Align axes across multiple
charts for easier visual comparison.
Draw attention to insights. Use highlighting, annotations, reference lines, or visual
elements to direct focus to key takeaways, anomalies, or opportunities hidden in the data.
Don't just make your audience sift through all the data unguided.
Consider functionality. Enable sorting, filtering, or drill-down interaction to allow
users to explore the data from different perspectives. But avoid overly flashy, distracting
animations.
Following visualization best practices allows you to turn your retail data into impactful
stories that drive smarter decisions.
Forecasting
Forecasting models can help retailers predict future sales, inventory needs, and customer
behavior. Time series models like ARIMA are useful for short-term forecasts based on
historical sales patterns.
Causal forecasting looks at predictors like promotions, price changes, and events to estimate
future outcomes. AI-powered forecasting tools can surface nonlinear patterns to improve
accuracy. With better forecasts, retailers can align inventory, supply chain, marketing, and
operations.
Segmentation
Dividing customers into segments allows for more personalized marketing and
merchandising. Retailers can segment by demographics, purchase history, channel
preferences, geography, and many other factors.
Advanced segmentation may incorporate machine learning to find hidden patterns.
Dynamic segmentation analyzes behavior in real time. For example, if a customer's web
browsing shows intent for an upcoming trip, they can be targeted with relevant offers.
Predictions
Retail analytics isn't just about reporting on the past. Advanced models can make predictions
about the future.
For example, uplift modeling identifies customer response to specific offers, churn models
predict the likelihood of customer attrition, and recommendation engines forecast which
products each customer will likely purchase.
Predictions enable retailers to take proactive actions to influence future outcomes. The most
accurate models involve a combination of machine-learning algorithms and human insight.
With the right data foundation in place, advanced analytics takes retail intelligence to the next
level. Techniques like forecasting, segmentation, and predictions move retailers from reactive
to proactive.
Prescriptive analytics can even recommend the optimal actions to achieve desired
outcomes. However, advanced analytics requires investment in skilled analysts, modeling
tools, and change management. The high-impact insights uncovered are well worth the effort
for forward-thinking retailers.
Retailers often have data spread across various systems and databases like POS, e-commerce,
CRM, inventory management, etc.
Bringing this disparate data together into a unified view is critical for getting a complete
picture of customers and performance. Lack of data integration leads to blindspots and lost
opportunities.
Dirty Data
Bad data quality hampers the accuracy and reliability of retail analytics. Issues like
incomplete data, errors, outdated information, duplication, etc., can undermine insights.
Retailers need to invest in data governance, cleansing, and management to trust the
output of analytics.
Not Actionable
Retail analytics that does not lead to any action loses its purpose. Reports and dashboards full
of vanity metrics fail to impact decisions or outcomes. Analytics should be connected with
operational workflows and key business priorities to drive real value.