Module 10
Enforcing Data Quality
Module Overview
Introduction to Data Quality
Using Data Quality Services to Cleanse Data
• Using Data Quality Services to Match Data
Lesson 1: Introduction to Data Quality
What Is Data Quality and Why Do You Need It?
Data Quality Services Overview
What Is a Knowledge Base?
What Is a Domain?
What Is a Reference Data Service?
Creating a Knowledge Base
• Demonstration: Creating a Knowledge Base
What Is Data Quality and Why Do You Need It?
• Business decisions should be made on trusted
data
• Data quality issues in sources can be propagated
into the data warehouse:
• Invalid data values
• Inconsistencies
• Duplicate business entities
Data Quality Services Overview
• DQS is a knowledge-based solution for:
• Data cleansing
• Data matching
• DQS Components:
• Server
• Client
• Data cleansing SSIS transformation
What Is a Knowledge Base?
• Repository of knowledge about data:
• Domains define values and rules for each field
• Matching policies define rules for identifying duplicate
records
• Determine data for a DQS knowledge base:
• Analyze source databases and data warehouses for
inconsistencies, inaccuracies, and incompleteness
• Audit website and software forms used for data entry
to find free-form fields prone to creating low quality
data
• Look at dependent reporting systems and find
incorrect results
What Is a Domain?
• Domains:
• Are specific to a data field
• Contain the rules for the data
• Can be individual or composite
What Is a Reference Data Service?
• The Azure Marketplace hosts specialist data
cleansing providers, where you can:
• Set up an account
• Subscribe to a reference service
• Map your domain to the reference service
Creating a Knowledge Base
• Creating a knowledge base is an iterative
process:
1. Knowledge discovery
2. Domain management
Demonstration: Creating a Knowledge Base
In this demonstration, you will see how to:
• Create a knowledge base
• Perform knowledge discovery
• Perform domain management
Lesson 2: Using Data Quality Services to Cleanse
Data
Creating a Data Cleansing Project
Viewing Cleansed Data
Demonstration: Cleansing Data
• Using the Data Cleansing Data Flow
Transformation
Creating a Data Cleansing Project
1. Select a knowledge base
2. Map columns to domains
3. Review suggestions and corrections
4. Export results
Viewing Cleansed Data
• Output: the values for all fields after data
cleansing
• Source: the original value for fields that were
mapped to domains and cleansed
• Reason: the reason the output value was
selected by the cleansing operation
• Confidence: an indication of the confidence Data
Quality Services estimates for corrected values
• Status: the status of the output column (correct
or corrected)
Demonstration: Cleansing Data
In this demonstration, you will see how to:
• Create a data cleansing project
• View cleansed data
Using the Data Cleansing Data Flow
Transformation
• Input data to be cleansed
• Select knowledge base and map columns to
domains
• Output cleansed columns
Lab A: Cleansing Data
Exercise 1: Creating a DQS Knowledge Base
Exercise 2: Using a DQS Project to Cleanse Data
• Exercise 3: Using DQS in an SSIS Package
Logon Information
Virtual machine: 20767C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa55w.rd
Estimated Time: 30 minutes.
Lab Scenario
You have created an ETL solution for the
Adventure Works data warehouse, and invited
some data stewards to validate the process before
putting it into production.
The data stewards have noticed some data quality
issues in the staged customer data, and have
asked you to provide a way for them to cleanse
data, so that the data warehouse is based on
consistent and reliable data. The data stewards
have given you an Excel workbook containing
some examples of the issues found in the data.
Lab Review
Having completed this lab, you will now be able to:
• Create a DQS knowledge base
• Use DQS to cleanse data
• Incorporate data cleansing into an SSIS data flow
Lesson 3: Using Data Quality Services to Match
Data
Creating a Matching Policy
Creating a Data Matching Project
Viewing Data Matching Results
• Demonstration: Matching Data
Creating a Matching Policy
• Define matching rules for business entities
• Rules match entities based on domains:
Similarity:similar or exact match
Weight: percentage to apply if match succeeds
Prerequisite: mandatory domain match for rule to
succeed
• If the combined weight of all matches meets or
exceeds the rule’s minimum matching score, the
entities are duplicates
Creating a Data Matching Project
1. Select a knowledge base
2. Map columns to domains
3. Review match clusters
4. Export matches and survivors
• Select survivorship rule:
Pivot record
Most complete and longest record
Most complete record
Longest record
Viewing Data Matching Results
• Cluster ID: identifier for a cluster of matched
records
• Record ID: identifier for a matched record
• Matching Rule: the rule that produced the match
• Score: combined weighting of match criteria
• Pivot Mark: a matched record arbitrarily chosen
by Data Quality Services as the pivot record for a
cluster
Demonstration: Matching Data
In this demonstration, you will see how to:
• Create a matching policy
• Create a data matching project
• View data matching results
Lab B: Deduplicating Data
Exercise 1: Creating a Matching Policy
• Exercise 2: Using a DQS Project to Match Data
Logon Information
Virtual machine: 20767C-MIA-SQL
User name: ADVENTUREWORKS\Student
Password: Pa55w.rd
Estimated Time: 30 minutes
Lab Scenario
You have created a DQS knowledge base and
used it to cleanse customer data. However, data
stewards are concerned that the staged customer
data might include duplicate entries. For records
to be considered a match, the following criteria
must be true:
The Country/Region column must be an exact
match.
A total matching score of 80 or higher must be
achieved, based on the following weightings:
o An exact match of the Gender column has a
weighting of 10.
Lab Scenario (Continued)
o An exact match of the City column has a
weighting of 20.
o An exact match of the EmailAddress column
has a weighting of 30.
o A similar FirstName column value has a
weighting of 10.
o A similar LastName column value has a
weighting of 10.
o A similar AddressLine1 column value has a
weighting of 20.
Lab Review
Having completed this lab, you will now be able to:
• Add a matching policy to a DQS knowledge base
• Use DQS to match data
Module Review and Takeaways
• Review Question(s)