Data Quality for Data Lakes
Reference Avoid creating a data swamp by taking logical steps to enhance data quality in the data lake. The iterative process will ensure gradual
improvement in the quality of data during data engineering. A collaborative approach across various data users such as data engineers, data
Architecture scientists and data analysts is key to success.
1 Profile helps understand data
Informatica Data Quality anomalies and discovery data
patterns.
Build Rules to validate if data is
Landing Enrichment Enterprise 2
fit for business needs.
Streaming
Zone Zone Zone
3 Measure Initial KPIs to establish
1 4 7 baseline on the quality of data and
Machine Apps
IoT Data Set Measure establish historical trends.
Profile
Dictionaries Final
4 Set Dictionaries to help
Log files Social Mobile standardize data across multiple
systems.
On-Premises 2 5 8 5 Cleanse Data using business rules
Cleanse to help improve analytics and
Ingest Build Rules Harmonize Publish Certified reduce time on data remediation.
Data
Mainframe Application Databases
Servers 6 Handle Exceptions process as part
of your daily load. Automate
correction of data as much as
3 6 possible and involve data owners.
Data Hadoop
Documents Warehouse Measure Handle Measure Final KPIs at the
Initial Exception 7
consumption layer to establish
SaaS trust of data being published for
consumption.
8 Certified Data is the process of
ERP CRM Exceptions validating that the data is ready for
Data Lake business consumption and
provides a mechanism to
provision it.