1.
Introduction to Data Management for Analytics
(I) Data Management Frameworks
Note: Review Pdf 1.0 then do a google search on the frameworks
- DAMA-DMBOK
(2) Types of Data Storage in Cloud Computing
2. Data Modelling & Design
- Data Modelling
● Process of learning about the data; construct visual
representation of the parts of the data
● Goal is to show relationships between structures, data
points, data grouping, attributes of the data
- Functional Dependencies
● Definition: Constraint that determines the relation of one attribute to another attribute in a database
- Denoted by an arrow →
- Example: X → Y
- In example below, Employee Name/Salary/City are all functionally dependent on Employee Number. So we can
say Employee number → Employee Name/Salary/City
● Multivalued Dependency
- Definition: Occurs when there are two/more independent attributes in a table that is dependent on
another attribute
← Example: maf_year and color are independent of each other
- However, both are dependent on car_model
- Therefore, both columns/attributes are ‘multivalue dependent’ on car_model
- We can denote the r/s as car_model → maf_year | car_model → color
● Trivial Functional Dependency
- Definition: Occurs when the attribute that has the dependency is a subset of the attribute it is dependent on.
- Example: X → Y is a trivial functional dependency if Y is a subset of X
← Example: (Emp_id, Emp_name) → Emp_id is a trivial functional dependency as
Emp_id is a subset of (Emp_id, Emp_name)
● Non-Trivial Functional Dependency
- Definition: Basically where attribute that has the dependency is not a subset of the attribute it is
dependent on
← Example: Company → CEO. CEO is not a subset of company. We
must know the company before we know the CEO
- Similarly, CEO → Age. We must know who the CEO is before we
can tell his/her age
● Transitive Dependency
- Definition: Occurs between 3 or more attributes. Essentially, it’s an indirect non-trivial dependency
← Example: Company → Age is a transitive dependency
- We know Company → CEO and CEO → Age
- Therefore, Company → Age; We must know the company before we know
CEO, we must know who the CEO is before we know the age
- Normalization
● Refers to the process of removing
dependencies from the
tables/data
● Purpose is to avoid data
redundancy, insertion, update &
deletion anomaly
- Process of Normalization
● From 0NF to 1NF
- MatrixNum is the unique key for all the records
- We can say that the rest of the columns are ‘functionally dependent’ on MatrixNum
- From 0NF to 1NF, we’re removing the nesting/grouping on the Unique Key to associate it 1-1 to the
rest of the columns
● From 1NF to 2NF
- Essentially you want to remove partial functional dependencies in the table
- Idea of partial functional dependencies is when your data requires more than 1 unique keys working together to
uniquely identify/make sense of the data
- See example below → Your MatrixNum relates to Name/Programme. EnrollNum relates to Semester,
AcadYear, Course, CourseName.
- Your MatrixNum + EnrollNum relates to Result
- To normalize the data → You need to split it into separate tables
● From 2NF to 3NF
- Idea here is that you wanna remove transitive dependencies from the table
- In the 2NF table below, EnrollNum is associated with Course and Course is associated with CourseName
- So the transitive dependency is EnrollNum → Course → CourseName
- To normalize the data → You need to split the Course & CourseName into separate mapping table
- Data Model & Data Modelling Notation
● Chen Notation
- Lecturer Comments: Very old and overly technical
● Crow’s Feet Notation
- Lecturer’s Comments: This is the industry standard. We will use this for purposes of the course as
well
● Unified Modelling Language (UML) Notation
- :Lecturer Comments: Usually used by software developers
- Designing a Database (Phases of Database Design)
● Data Model: Plan/blueprint for a database design; More generalized and abstract than a database
design
● Phases of a Database Design:
- The actual implementation is (1) You come up w the data model, (2) You write SQL scripts to
represent the data model & create the tables, (3) You run the SQL scripts and insert the actual data
● Conceptual → You think/come up w the table/fields you require
Logical → You determine the r/s between the different tables
Physical → You come up w the details for the different fields/tables (E.g. Set character limits/data type, etc…)
- Entity Relationship Diagram (ER Diagram)
● Basically a structural diagram used in database design; contains entity and maps out the relationships
between entities
- Aspects of the ER Model
● Entity
- Basically the ‘table’ in a database
- Represented by a name and a rectangle, with its attributes listed in the body of
the rectangle
● Entity Attributes
- Refers to the property/characteristic of the entity/table
- For databases, it has a name and the data type/size of the attribute
● Primary Key
- Refers to a special entity attribute that uniquely defines a record in a table
- This means that the column/field that is the PK, must not be repeated in the table
- Example, id field of a given table
● Foreign Key
- Is essentially a PK of another table
- FK need not be unique in the table where it is not the PK
- Concept of Cardinality for ER-Model
● ER Model relationships are classified by their cardinality
● Cardinality refers to the possible number of occurrences in one entity which is associated with the
number of occurrences in another
- E.g. ONE team has MANY players, we can then say Team has a one-to-many cardinality with Player
● Notations:
● Reading/Interpreting a ER Diagram/Relation
- Between Customer & Pizza, to establish relationship between Customer → Pizza, we must look at the crow’s
feet notation attached to Pizza
- In this case, its 0/Many attached to Pizza → Means a customer can either order 0 pizza or many pizzas
- For the converse, it's also 0/Many attached to Customer → Means a Pizza can be ordered by 0 customer or many
customers
- Weak Entity
● Defined as an entity that does not have attributes/columns that can identify its records uniquely
- E.g. No primary key in weak entity
- Example: Intermediate tables → Where table consists of PKs of two other tables
- Database Design Misc Example
● In this example, COMPANY & PART tables can’t be joined to each other
- Solution is to create an intermediary table mapping CompanyName & PartNumber to allow both tables
to join
- Lecturer: Would be good to create a PK in COMPANY_PART_INT table for ease of reference
- Database Property → ACID: Atomicity, Consistency, Isolation, Durability
● Properties that all transactions should possess
○ Atomicity
- Relates to the ‘all or nothing’ property
- The transaction must be an indivisible unit that is either performed it its entirety/not performed
at all
○ Consistency
- DB transaction must transform the database from one consistent state to another consistent
state
○ Isolation
- Transactions must be able to execute independently from one another
- I.e. One incomplete transaction must not affect another transaction
○ Durability
- Effects of a successful transaction must be permanently recorded in the database and not lost
in a subsequent failure
- Types of Data
● Transactional Data
- Refers to the data that is captured from transactions
- Example: Time of transaction, place, price, payment method employed, etc…
- Usually captured at point of sale through a POS system
● Analytical Data
- Transaction data that is transformed via calculations/analysis
● Master Data
- Refers to the actual critical business objects upon which transactions are performed
- Data Warehouse
● The general idea for data storage in Data Warehouse is to provide information & knowledge to support
decision making in your org
● Having the data in normalized/OLTP form usually is not good as it can be computationally expensive to
join data together to perform analysis
● Therefore, often data in Data Warehouse is stored in a de-normalized form
- OLTP vs OLAP
- Dimensionality Modelling (Converting data from OLTP to OLAP)
● Basically, you are de-normalizing the data for high performance access (Less joins)
● You can represent the denormalized data via 2 schemas:
○ Star Schema
- Structure that contains a fact table in the center (Fact tables are tables that contain
transactional data; E.g. Sales)
- Fact table is surrounded by dimension tables containing reference data (Dimension tables are
tables that contain reference information; E.g. Store_id to Store Name)
○ Snowflake Schema
- Variant or star schema where dimension tables do not contain denormalized data
● Terminology
- Dimension Tables: Tables connected to fact table containing reference/static data
- Attribute: Non-key fields in Dimension tables
- Fact Table: Central table in a dimensional model contain facts/transactional data
- Facts: Business measures/metrics
- Grain/Granularity: Level of detail/frequency at which data in Fact table is recorded
3. Data Integration and Interoperability
- Data Integration
● Process of bringing data from disparate sources together to provide users w a unified view
● Purpose: To make data more easily available and easier to consume by systems/end-users
● Benefits: Free-up resources, improve data quality, improved operational efficiency, can gain valuable
insight through data
- Data Integration Tools
- Set Theory for Data Join
1. Outer Join
2. Left/Right Join
3. Inner Join
- Data Acquisition and Extraction
● Data Acquisition
- Process of capturing, integrating, transforming, aggregating and loading the data to the data
warehouse after assuring data quality
- Process is more inclusive/comprehensive that ETL (Extract Transform Load) / ELT (Extract Load
Transform)
● ETL/ELT
○ ETL
○ ELT
● ETL vs ELT
- If you transform first, you are in practice determining the schema in advance, data stored might be too
inflexible for use
- If you load first, the risk is that it might become ‘rubbish’
● Data Warehouse
- Structured data is loaded into the data warehouse for analytical use
● Data Lake
- Data lake is the place where all sorts of data is stored
- Structured, textual, unstructured data
● Data Lakehouse
- Similar to Data Lake but with data management architecture baked in to index/cache all forms of data
stored in the data lake
- Data Consolidation
● Basically consolidating data from different silos to a single place
- Data Visualization
● Bring all data from different sources/places to one platform
● One platform to access/combine/analyze the dataset; reduce access cost
- Data Federation
● A software/platform that allows multiple databases to function as one
● Data from multiple sources are combined into a common model; i.e. can query/join using a common
platform/schema
- Data Replication
● Data is intentionally stored in > 1 site/server
● Purpose is to allow data to be available in case of downtime/heavy traffic; idea of improving data
accessibility/uptime
- Data Harmonization
- Data Pipeline
- Data Engineering
- Data Fabric
3. Data Project Implementation
4. Data Governance - Data Quality, Security & Privacy
- Types of Data for Data Governance
- Data Governance Implementation: Sales Analytics
- Data Quality
● Quality is assessed wrt to data’s fit to purpose it was intended for
- High quality means it accurately represents the real world constructs
- Bad data will result in low information quality, then as it moves up the management hierarchy, it leads
to bad business decision
- Measuring/Assessment Data Quality - Data Quality Checks
1. Data Sampling
(i) Random
(ii) Sampling with Fixed Criteria
2. Data Profiling
- Process where data is examined & analyzed; generate summary statistics
- Purpose: Give an overview of data to ensure any discrepancies/risks/trends are spotted
- Data Dictionary
● Specification/description of data structures in a
database/data model/data source
● Contains list of entities/tables/datasets and their
fields/columns/data elements
● Information may include: Data type, description,
relationships, aliases, constraints, sources,
etc…
● Data Catalog - Distinct from Data Dictionary -
Basically an inventory of data objects in your
organization
- Data Mapping
● Definition: R/s between 2/more datasets and matching/connecting fields from one dataset to another
● Purpose: Link data fields across areas to create standardized accurate data
- Data Privacy: Data Confidentiality, Anonymization, Masking
1. Data Masking
- Definition: Technique that scrambles data to create an inauthentic company for non-production
purposes
- After masking, data retains the characteristics & integrity of production data
- Masked data usually used for analytics/training/testing
2. Data Redaction
- Definition: Data masking technique that replaces data with chose redaction
- E.g. → S9300000J → XXXXXX000J
- Purpose: Used as a secrecy control/privacy control, usually used for hiding personal identifiable information
(PII)
3. Data Encryption
- Definition: Translate data into another form → Only people w access to a secret key/password can read it
4. Data Masking/Redaction vs Encryption
- Data Masking/Redaction used more frequently as it allows organization to maintain usability of
customer data; Usually used as the standard solution for pseudonymisation
- Aspects of Data Security
1. Data Access
- Authentication/authorization of access
- Data access is recorded and will be audited
- Data access must necessarily relate to location where data is stored; on-prem vs cloud
2. Data Classification (User Role)
3. Data Lineage
- Need to understand how changes upstream may affect downstream sources
- E.g. Upstream data source gets an update, downstream data might be impacted
4. Data Encryption
- Whether the data is encrypted at rest or encrypted while in-transit
- Data Classification
● Definition: Process of organization information/data assets using an agreed upon categorization logic
- Result usually is a large repository of metadata useful to make further decision/to facilitate use and
governance of data
- E.g. Can make decisions on the value/security/access rights/usage rights/privacy/storage
location/quality/retention period of the data
● Example - GDPR Classification Tags
- Data Lineage