1
Index
1) Foundations: Data, Data, Everywhere.....................................................2
1.1) Module 1 – Introducing data analytics & analytical thinking.............2
Roadmap...............................................................................................2
Google Data Analytics Certificate glossary..............................................6
1.2) Module 2 – The wonderful world of data...........................................6
Variations of the data life cycle.............................................................6
Phases of data analysis.........................................................................7
Key data analyst tools...........................................................................7
Difference between Spreadsheets & Databases...................................8
2) Ask Questions to Make Data-Driven Decisions........................................9
2.1) Module 1 – Ask effective questions...................................................9
Common problem types........................................................................9
SMART questions.................................................................................10
2.2) Module 2 – Ask effective questions.................................................10
Qualitative and quantitative data in business.....................................10
Big and small data..............................................................................11
2.3) Module 3 - Work with spreadsheets................................................11
Spreadsheet errors and fixes..............................................................11
Pro Tip: Spotting errors in spreadsheets with conditional formatting..12
Difference between formulas and functions........................................13
Context can turn raw data into meaningful information.....................13
3) Prepare data for Exploration.................................................................13
3.1) Module 1 - Data types and structures.............................................13
Differentiate data types and structures..............................................13
Data modeling levels and techniques.................................................15
Use Boolean logic................................................................................16
Transforming Data...............................................................................17
3.2) Module 2 - Data Responsibility........................................................17
Identify good data sources..................................................................17
Essential data ethics...........................................................................18
2
Data anonymization............................................................................19
The Open Data Debate.......................................................................19
3.3) Module 3 – Work with databases.....................................................20
Metadata is as important as the data itself.........................................21
Metadata and Metadata Repositories..................................................22
Useful Links...............................................................................................24
1) Foundations: Data, Data, Everywhere
1.1) Module 1 – Introducing data analytics & analytical
thinking
Roadmap
1) Foundation
What you will learn:
Real-life roles and responsibilities of a junior data analyst
How businesses transform data into actionable insights
Spreadsheet basics
Database and query basics
Data visualization basics
Skill sets you will build:
Using data in everyday life
Thinking analytically
Applying tools from the data analytics toolkit
Showing trends and patterns with data visualizations
Ensuring your data analysis is fair
2) Ask
What you will learn:
How data analysts solve problems with data
The use of analytics for making data-driven decisions
Spreadsheet formulas and functions
3
Dashboard basics, including an introduction to Tableau
Data reporting basics
Skill sets you will build:
Asking SMART and effective questions
Structuring how you think
Summarizing data
Putting things into context
Managing team and stakeholder expectations
Problem-solving and conflict-resolution
3) Prepare
What you will learn:
How data is generated
Features of different data types, fields, and values
Database structures
The function of metadata in data analytics
Structured Query Language (SQL) functions
Skill sets you will build:
Ensuring ethical data analysis practices
Addressing issues of bias and credibility
Accessing databases and importing data
Writing simple queries
Organizing and protecting data
Connecting with the data community (optional)
4) Process
What you will learn:
Data integrity and the importance of clean data
The tools and processes used by data analysts to clean data
Data-cleaning verification and reports
4
Statistics, hypothesis testing, and margin of error
Resume building and interpretation of job postings (optional)
Skill sets you will build:
Connecting business objectives to data analysis
Identifying clean and dirty data
Cleaning small datasets using spreadsheet tools
Cleaning large datasets by writing SQL queries
Documenting data-cleaning processes
5) Analyse
What you will learn:
Steps data analysts take to organize data
How to combine data from multiple sources
Spreadsheet calculations and pivot tables
SQL calculations
Temporary tables
Data validation
Skill sets you will build:
Sorting data in spreadsheets and by writing SQL queries
Filtering data in spreadsheets and by writing SQL queries
Converting data
Formatting data
Substantiating data analysis processes
Seeking feedback and support from others during data analysis
6) Share
What you will learn:
Design thinking
How data analysts use visualizations to communicate about data
5
The benefits of Tableau for presenting data analysis findings
Data-driven storytelling
Dashboards and dashboard filters
Strategies for creating an effective data presentation
Skill sets you will build:
Creating visualizations and dashboards in Tableau
Addressing accessibility issues when communicating about data
Understanding the purpose of different business communication
tools
Telling a data-driven story
Presenting to others about data
Answering questions about data
7) Act
What you will learn:
Programming languages and environments
R packages
R functions, variables, data types, pipes, and vectors
R data frames
Bias and credibility in R
R visualization tools
R Markdown for documentation, creating structure, and emphasis
Skill sets you will build:
Coding in R
Writing functions in R
Accessing data in R
Cleaning data in R
Generating data visualizations in R
Reporting on data analysis to stakeholders
6
8) Capstone
What you will learn:
How a data analytics portfolio distinguishes you from other
candidates
Practical, real-world problem-solving
Strategies for extracting insights from data
Clear presentation of data findings
Motivation and ability to take initiative
Skill sets you will build:
Building a portfolio
Increasing your employability
Showcasing your data analytics knowledge, skill, and technical
expertise
Sharing your work during an interview
Communicating your unique value proposition to a potential
employer
Google Data Analytics Certificate glossary
C:\Users\sahil\Downloads\Google Data Analytics Certificate glossary.docx
1.2) Module 2 – The wonderful world of data
Variations of the data life cycle
There are six stages to the data life cycle. Here's a recap:
1. Plan: Decide what kind of data is needed, how it will be
managed, and who will be responsible for it.
2. Capture: Collect or bring in data from a variety of different sources.
3. Manage: Care for and maintain the data. This includes determining
how and where it is stored and the tools used to do so.
4. Analyze: Use the data to solve problems, make decisions, and
support business goals.
7
5. Archive: Keep relevant data stored for long-term and future
reference.
6. Destroy: Remove data from storage and delete any shared copies
of the data.
Note: Be careful not to confuse the six stages of the data life cycle (plan,
capture, manage, analyze, archive, and destroy) with the six phases of the
data analysis process (ask, prepare, process, analyze, share, and act).
They are not interchangeable.
One data management principle is universal: Govern how data is handled
so that it is accurate, secure, and available to meet your organization's
needs.
Phases of data analysis
1. Ask Phase: Understand stakeholder expectations, define the
problem, and decide on key questions to solve the problem. Focus
on who the stakeholders are, what they want, and how to
communicate with them.
2. Prepare Phase: Identify and locate relevant data. Ensure the data
is objective and unbiased to support fair decision-making.
3. Process Phase: Clean, transform, and combine datasets to
eliminate errors, inaccuracies, and outliers, ensuring the data is
accurate and complete.
4. Analyze Phase: Analyze the prepared data using tools like
spreadsheets, SQL, and programming languages (e.g., R) to turn
data into actionable insights.
5. Share Phase: Share findings with stakeholders using data
visualizations to communicate insights effectively, aiding data-
driven decision-making.
6. Act Phase: Apply insights by completing a case study project and
preparing for the job search, demonstrating your skills and adding
to your portfolio.
Key data analyst tools
Spreadsheets:
Used for collecting, organizing, sorting, and storing data
8
Helps identify patterns and structure data for specific projects
Enables creation of data visualizations (graphs, charts)
Popular tools: Microsoft Excel, Google Sheets
Databases and Query Languages:
Databases store structured data (e.g., MySQL, Microsoft SQL Server,
BigQuery)
Query languages (SQL) allow analysts to:
o Isolate specific data
o Understand database requests
o Select, create, add, or download data for analysis
Visualization Tools:
Turn complex data into understandable visuals (graphs, charts,
maps, etc.)
Help stakeholders make informed decisions and business strategies
Popular tools: Tableau, Looker
o Tableau: Drag-and-drop feature for creating interactive
graphs and dashboards
o Looker: Connects directly to databases for real-time data
visualization
Programming Languages:
R and Python used for statistical analysis, visualization, and
advanced data analysis
Key Takeaway: Data analysts have many tools to choose from, which will
be explored in depth throughout the program.
Difference between Spreadsheets & Databases
Spreadsheets Databases
Accessed through a software application Database accessed using a query language
Structured data in a row and column format Structured data using rules and relationships
Organizes information in cells Organizes information in complex collections
Provides access to a limited amount of data Provides access to huge amounts of data
9
Spreadsheets Databases
Manual data entry Strict and consistent data entry
Generally, one user at a time Multiple users
Controlled by the user Controlled by a database management system
___________________________________________________________________________
______
2) Ask Questions to Make Data-Driven
Decisions
2.1) Module 1 – Ask effective questions
Common problem types
Data analysts work with a variety of problems. Six common types include:
1. Making Predictions: Analysts use data on past ads (location,
media type, customer acquisition) to predict the best advertising
methods for reaching target audiences, though past data can't
guarantee future results.
2. Categorizing Things: Analysts classify customer service calls
using keywords or scores to identify top-performing representatives
or correlate actions with customer satisfaction.
3. Spotting Something Unusual: Analysts help develop software for
smartwatches to detect unusual health data patterns and set off
alarms when trends deviate from the norm.
4. Identifying Themes: Analysts group categories into broader
themes to prioritize product features for improvement, such as user
beliefs, practices, and needs in UX studies.
5. Discovering Connections: Analysts analyze shipping hub wait
times to determine schedule adjustments that improve on-time
deliveries in logistics.
6. Finding Patterns: Analysts examine maintenance data to discover
patterns, such as the impact of delayed maintenance on machine
failure.
Key Takeaway: Developing an analytical mindset helps in identifying
the type of problem and creating solutions that address the needs of
stakeholders.
10
SMART questions
Note: open-ended questions are recommended to gather more insightful
data, such as asking about the importance of four-wheel drive or the most
desired car features. This method helps ensure that responses directly
address the problem and provide valuable insights.
2.2) Module 2 – Ask effective questions
Qualitative and quantitative data in business
Key Takeaways:
Data analysts use both quantitative and qualitative data in their
work.
Quantitative data provides the "what" (e.g., attendance, profit,
showtimes).
Qualitative data provides the "why" (e.g., reasons behind
decisions or preferences).
Combining both types of data gives a fuller understanding of trends
and behaviors.
Example: Understanding why people prefer certain theaters, like
those with reclining chairs or unique offerings (e.g., root beer).
11
Qualitative insights can help make decisions, like purchasing more
recliners or adjusting showtimes.
Without qualitative data, analysts would miss key insights behind
the numbers.
Big and small data
Small data Big data
Describes a dataset made up of specific Describes large, less-specific datasets that
metrics over a short, well-defined time
cover a long time period
period
Usually organized and analyzed in Usually kept in a database and queried
spreadsheets
Likely to be used by small and midsize Likely to be used by large organizations
businesses
Simple to collect, store, manage, sort, Takes a lot of effort to collect, store,
and visually represent manage, sort,
and visually represent
Usually already a manageable size for Usually needs to be broken into smaller
analysis pieces in order to be organized and
analyzed effectively for decision-making
The three (or four) V words for big data:
Volume: Refers to the sheer amount of data.
Variety: Describes the different types of data.
Velocity: Refers to the speed at which data is processed.
Veracity (optional fourth V): Focuses on the quality and reliability of
the data.
These factors are crucial when working with large, complex datasets
12
2.3) Module 3 - Work with spreadsheets
Spreadsheet errors and fixes
Error Description Example
#DIV/ A formula is trying to divide a value in =B2/B3, when the cell B3 contains the value 0
0! a cell by 0 (or an empty cell with no
value)
#ERRO (Google Sheets only) Something can’t =COUNT(B1:D1 C1:C10) is invalid because
R! be interpreted as it has been input. the cell ranges aren't separated by a comma
This is also known as a parsing error.
#N/A A formula can't find the data The cell being referenced can't be found
#NAM The name of a formula or function The name of a function is misspelled
E? used isn't recognized
#NUM! The spreadsheet can't perform a =DATEDIF(A4, B4, "M") is unable to calculate
formula calculation because a cell has the number of months between two dates
an invalid numeric value because the date in cell A4 falls after the date
in cell B4
#REF! A formula is referencing a cell that A cell used in a formula was in a column that
isn't valid was deleted
#VALU A general error indicating a problem There could be problems with spaces or text, or
E! with a formula or with referenced cells with referenced cells in a formula; you may
have additional work to find the source of the
problem.
Pro Tip: Spotting errors in spreadsheets with conditional
formatting
Conditional formatting can be used to highlight cells a different color
based on their contents. This feature can be extremely helpful when you
want to locate all errors in a large spreadsheet.
Conditional formatting in Microsoft Excel
To set up conditional formatting in Microsoft Excel to highlight all cells in a
spreadsheet that contain errors, do the following:
1. Click the gray triangle above row number 1 and to the left of
Column A to select all cells in the spreadsheet.
2. From the main menu, click Home, and then click Conditional
Formatting to select Highlight Cell Rules > More Rules.
13
3. For Select a Rule Type, choose Use a formula to determine
which cells to format.
4. For Format values where this formula is true, enter =ISERROR(A1).
5. Click the Format button, select the Fill tab, select yellow (or any
other color), and then click OK.
6. Click OK to close the format rule window.
To remove conditional formatting, click Home and select Conditional
Formatting, and then click Manage Rules. Locate the format rule in the
list, click Delete Rule, and then click OK.
Conditional formatting in Google Sheets
To set up conditional formatting in Google Sheets to highlight all cells in a
spreadsheet that contain errors, do the following:
1. Click the empty rectangle above row number 1 and to the left of
Column A to select all cells in the spreadsheet. In the Step-by-step
in spreadsheets video, this was called the Select All button.
2. From the main menu, click Format and select Conditional
Formatting to open the Conditional format rules pane on the right.
3. While in the Single Color tab, under Format rules, use the drop-
down to select Custom formula is, enter =ISERROR(A1), select
yellow (or any other color) for the formatting style, and then click
Done.
To remove conditional formatting, click Format and select Conditional
Formatting, and then click the Trash icon for the format rule.
Difference between formulas and functions
A formula is a set of instructions used to perform a calculation using
the data in a spreadsheet.
A function is a preset command that automatically performs a
specific process or task using the data in a spreadsheet.
Context can turn raw data into meaningful information.
It is very important for data analysts to contextualize their data. This
means giving the data perspective by defining it. To do this, you need to
identify:
Who: The person or organization that created, collected, and/or
funded the data collection
What: The things in the world that data could have an impact on
Where: The origin of the data
When: The time when the data was created or collected
14
Why: The motivation behind the creation or collection
How: The method used to create or collect it
___________________________________________________________________________
______
3) Prepare data for Exploration
3.1) Module 1 - Data types and structures
Differentiate data types and structures
Primary vs. Secondary Data
Primary Data: Collected firsthand by a researcher (e.g., interview
data, survey responses, questionnaire data).
Secondary Data: Gathered by others or from other research (e.g.,
purchased customer profiles, demographic data, census data).
Internal vs. External Data
Internal Data: Stored within a company’s own systems (e.g.,
employee wages, sales data, product inventory levels).
External Data: Stored outside the company or organization (e.g.,
national wage averages, customer credit reports).
Continuous vs. Discrete Data
Continuous Data: Measured with almost any numeric value (e.g.,
height, video runtime, temperature).
Discrete Data: Counted and has limited values (e.g., daily hospital
visitors, room capacity, tickets sold).
Qualitative vs. Quantitative Data
Qualitative Data: Subjective and explanatory (e.g., favorite
exercise activity, customer service brands, fashion preferences).
Quantitative Data: Objective and measurable (e.g., percentage of
women doctors, elephant population, distance to Mars).
Nominal vs. Ordinal Data
Nominal Data: Categorized without order (e.g., customer type, job
applicant type, property status).
Ordinal Data: Categorized with a set order or scale (e.g., movie
ratings, ranked-choice voting, satisfaction levels).
15
Structured vs. Unstructured Data
Structured Data: Organized in rows and columns (e.g., expense
reports, tax returns, inventory).
Unstructured Data: Cannot be organized in a relational database
(e.g., social media posts, emails, videos).
Data modeling levels and techniques
Data Modeling: The process of creating visual diagrams to
represent how data is organized and structured.
Analogy: Like a house blueprint—different users (analysts,
engineers, etc.) use the model to understand the data structure.
Purpose: Helps stakeholders understand and use data efficiently.
Levels of Data Modeling
Conceptual Data Modeling:
o High-level view of data structure.
o Focuses on how data interacts across the organization.
16
o Does not include technical details.
o Example: Defines business requirements for a new database.
Logical Data Modeling:
o Focuses on technical details (e.g., relationships, attributes,
and entities).
o Defines how records are uniquely identified but not the exact
table names.
Physical Data Modeling:
o Focuses on how the database operates.
o Includes specific details like table names, column names, and
data types.
Data Modeling Techniques
Entity Relationship Diagram (ERD): Visualizes relationships
between entities in a data model.
Unified Modeling Language (UML): Detailed diagram that
describes system entities, attributes, operations, and relationships.
Note for Junior Analysts: You'll likely use your organization's
existing modeling technique, but understanding the various
methods is important.
Data Modeling and Data Analysis
Data Modeling's Role: Helps explore high-level data details and
relationships within an organization’s systems.
Data Analysis: Often required to understand how data is structured
and to create effective data models.
Collaboration: Data models help facilitate understanding and
collaboration across teams, enhancing overall communication.
Use Boolean logic
Boolean Logic: A system of logic used in data analysis and programming
to filter results based on conditions.
Operators:
17
AND: Both conditions must be true for the result to be true.
OR: At least one condition must be true for the result to be true.
NOT: Excludes a specific condition, making the result true only
when the condition is false.
Truth Tables: Used to show how conditions work together in Boolean
logic.
Power of Multiple Conditions: Combine different conditions using
operators and parentheses to filter results, e.g., IF ((Color = "Grey") OR
(Color = "Pink")) AND (Waterproof="True").
Key Takeaways:
Boolean operators help filter data by combining multiple conditions.
Used widely in data queries and programming for more refined
results.
Transforming Data
Why Transform Data?:
Data Organization: Improves usability
Data Compatibility: Enables use across different systems
Data Migration: Ensures compatibility when moving data between
systems
Data Merging: Combines data from multiple sources
Data Enhancement: Adds more detailed fields
Data Comparison: Allows apples-to-apples comparisons
Long Data: Each row represents a single data point for a specific item,
often with multiple rows for the same item over time or across different
variables (e.g., multiple stock prices for a company on different dates).
Wide Data: Each row contains multiple data points for a specific item,
where each variable (e.g., stock prices for different companies) is
represented as a separate column. This format is often used for easier
comparison across items.
Wide Data is preferred for:
Creating tables and charts with few variables per subject
Comparing straightforward line graphs
Long Data is preferred for:
18
Storing many variables about each subject (e.g., years of interest
rates)
Performing advanced statistical analysis or graphing
3.2) Module 2 - Data Responsibility
Identify good data sources
The ROCCC Method for Identifying Good Data:
o R – Reliable: Trustworthy, accurate, complete, unbiased, and
vetted data.
o O – Original: Always validate data with the original source,
even if discovered via a third party.
o C – Comprehensive: Data should contain all critical
information needed for analysis, like a full company review,
before making decisions.
o C – Current: The data should be up-to-date and relevant to
the task (e.g., not using outdated client lists).
o C – Cited: Ensure data is credible by checking the original
sources and citations.
Three Key Questions to Evaluate Data:
o Who created the data set?
o Is the source part of a credible organization?
o When was the data last refreshed?
Reliable Data Sources: Public data sets, academic papers, financial
data, and governmental agency data are great places to find vetted,
reliable data.
Essential data ethics
Ethics Definition: A set of principles to live by, helping individuals
navigate moral decisions in life. As people grow older, their personal
code of ethics becomes more rational and helps them face challenges
and opportunities.
Data Ethics: An extension of general ethics focused on data, guiding
how data is collected, shared, and used. It addresses issues like
privacy, transparency, and accountability.
Key Aspects of Data Ethics (6 main topics):
19
1. Ownership: Individuals own the raw data they provide, not
the organizations that collect and process it. Individuals have
control over how their data is used and shared.
2. Transaction Transparency: Data processing and algorithms
should be understandable and explainable to the individuals
providing the data. This ensures fairness and helps avoid
biased results.
3. Consent: Individuals should have clear, explicit information
about how their data will be used before agreeing to provide
it. Consent should prevent unfair targeting, especially for
marginalized groups.
4. Currency: Individuals should be informed about financial
transactions resulting from their data, including the scale and
purpose of these transactions, and should be able to opt out.
5. Privacy and Openness: These aspects are essential and will
be explored further later in the video.
Data anonymization
What is Data Anonymization? : The process of protecting private or
sensitive data by removing personally identifiable information (PII).
This is done by techniques like blanking, hashing, or masking personal
data.
Methods:
o Blanking: Removing certain parts of the data.
o Hashing: Converting information into a fixed-length code.
o Masking: Altering the values of the data.
Sensitive Data: Healthcare and financial data are some of the most
sensitive and require anonymization techniques, such as de-identification
(removal of PII).
Common Types of Data That Should Be Anonymized:
Telephone numbers, Names, License plates and license numbers, Social
Security numbers, IP addresses. Medical records, Email addresses,
Photographs, Account numbers
Importance of Anonymization:
20
Protecting privacy and ensuring safety by keeping sensitive data
secure.
Prevents data from being used to track or identify individuals, thus
preserving privacy.
The Open Data Debate
Open data refers to data that is freely available, accessible, and
shareable, allowing for reuse and redistribution.
To be considered open data, it must meet three key criteria:
o Available and Accessible: The data must be publicly
available as a complete dataset.
o Reusability: The data must be provided under terms that
allow it to be reused and redistributed.
o Universal Participation: Anyone can use, reuse, and
redistribute the data.
The Open Data Debate: What Data Should Be Publicly Available?
Benefits of Open Data:
o Wider usage of credible databases can drive scientific
collaboration, research, and decision-making.
o Combining open data with other datasets can expand
analytical capacity.
Concerns About Privacy:
o Open data must balance the benefits of sharing with the
protection of individual privacy.
o Third-party Data: Collected by entities that do not have a
direct relationship with the data. For example, third parties
might track website visitors to create audience profiles for
targeted advertising.
o Personal Identifiable Information (PII): Data that can
identify individuals (e.g., addresses, social security numbers,
medical records). This data must be kept secure to protect
privacy.
Key Takeaway: While open data can benefit research and decision-
making, it's crucial to ensure the privacy of individuals is protected by
limiting exposure of PII.
21
3.3) Module 3 – Work with databases
Relational Databases:
Contain tables that can be connected through relationships.
Help data analysts organize and link data based on common
attributes.
Simplify data analysis by making data easier to search and use.
Non-Relational Databases: Group all variables together, making data
harder to analyze.
Normalization: A process of organizing data to reduce redundancy,
increase integrity, and simplify complexity in databases. Involves creating
tables and establishing relationships between them.
Key Elements of Relational Databases:
Primary Key: A unique identifier for each record in a table (e.g.,
customer_id).
Foreign Key: A field in one table that links to the primary key of
another table.
Composite Key: A primary key made from multiple columns (e.g.,
customer_id and location_id together).
The primary key in the Books Table is book_id.
The primary key in the Borrowers Table is borrower_id.
22
The composite key in the Loans Table is a combination of
borrower_id and book_id.
Both borrower_id and book_id in the Loans Table are foreign keys
linking to the respective primary keys in the Borrowers Table and
Books Table.
Tables in Relational Databases: Can include tables like Customer,
Revenue, Branch, Date, and Product.
SQL (Structured Query Language): A tool used by data analysts to
communicate with and query relational databases. Enables users to
extract specific data from related tables in a database.
Metadata is as important as the data itself
What is Metadata? Metadata is data about data. It provides context,
structure, and details about a data file, helping data analysts understand
the content and usage of the data. It answers the who, what, when,
where, why, and how of data.
Elements of Metadata:
File Type: What type of file is it?
Date & Time: When was it created or modified? Who created it?
Title & Description: What is it? What does it contain?
Geolocation: Where was it created (e.g., photo)?
Tags & Categories: How is it indexed or described?
Modification Details: Who last modified it and when?
Access Permissions: Who can access or update it?
Examples of Metadata:
Photos: Metadata includes filename, date, time, geolocation, and
device info.
Emails: Metadata includes sender, recipient, subject line, date, and
time. Hidden metadata includes server names, IP addresses, and
software details.
Spreadsheets/Docs: Metadata includes title, author, creation date,
page count, and modification history.
23
Websites: Metadata includes site creator, page title, description,
and tags.
Books & Audiobooks: Metadata includes title, author, publisher,
and for audiobooks, narrator and length.
Key Takeaways:
Metadata is essential for organizing and understanding data.
It helps analysts interpret, reuse, and manage data effectively.
Accurate metadata management ensures better data retrieval,
usage, and preservation.
Metadata and Metadata Repositories
Benefits of Metadata:
Reliability: Ensures data is accurate, precise, relevant, and timely, helping
analysts make reliable decisions.
Consistency: Metadata promotes uniformity, making data easier to
categorize, clean, store, and access. It ensures data can be compared and
analyzed across different sources.
Metadata Repositories:
Specialized databases that store and manage metadata, either physically
or in the cloud. Help analysts quickly access metadata without manually
searching each file. Describe the source, location, structure, and access
logs of data, making it easier to analyze multiple data sources together.
Metadata of External Databases:
Data analysts use second-party (data directly collected and sold by a
group) and third-party (data provided by external sources) data to gain
insights.
It’s important to confirm the reliability and accessibility of external data,
as well as obtain proper permissions.
Key Takeaways:
Metadata aids data-driven decision-making by ensuring data
reliability and consistency.
Metadata repositories store essential information about data
sources, helping analysts use data efficiently and correctly.
Kinds of metadata:
24
Descriptive Metadata: Describes the content to help identify and
locate data, such as a book's title, author, and keywords.
Structural Metadata: Defines how data is organized and related,
like database tables and file formats (e.g., CSV, JSON).
Administrative Metadata: Provides information on data creation,
management, and access, such as file creation date and access
permissions.
Statistical Metadata: Describes the statistical aspects of data
collection and processing, like survey sampling methods and
measurement units.
Reference Metadata: Explains the context and definitions of data,
such as a glossary of terms or variable definitions.
Provenance Metadata: Tracks the history and origin of data, like
the data source and transformation steps.
Geospatial Metadata: Provides information about the geographic
characteristics of data, such as GPS coordinates and map
projections.
3.4) Module 4 – Organize and protect data
File organization guidelines
Best Practices for Naming Files
Meaningful and Consistent Names: Use names that describe the
file’s contents and follow a consistent naming structure.
Include Key Information:
o Project Name: Clearly describe the project (e.g.,
SalesReport).
o Creation Date: Use the YYYYMMDD format to indicate when
the file was created (e.g., 20231125 for November 25, 2023).
o Revision Version: Include versioning with a clear numbering
system (e.g., v02).
Keep It Short and Simple: Make sure file names are concise and
easily readable.
File Name Example: SalesReport_20231125_v02.
File-Naming Guidelines
Avoid Special Characters: Don’t use spaces or special characters.
Use underscores or hyphens instead (e.g.,
SalesReport_2023_11_25_v02).
25
Consistent Style and Order: Maintain the same order (e.g.,
ProjectName_Date_Version) for all files.
Ensure Team Consistency: Create a sample text file outlining
naming conventions to help all team members follow the same
format.
File Organization
Create Logical Folders: Organize files in a hierarchical structure
with broader-topic folders at the top and specific files/subfolders
below.
Separate Completed and In-Progress Files: Store finished files
apart from ongoing ones to avoid confusion.
Archive Old Files: Move outdated files to a separate folder or
external storage to keep the workspace uncluttered.
Key Takeaways
Use consistent, meaningful file-naming conventions to make
data easier to find and organize.
Agree on a file-naming structure before starting the project.
Document naming conventions for easy reference by team
members.
Balancing Data Security and Analytics
Data Security: Protects data from unauthorized access and
corruption to keep sensitive information safe.
Analytics Needs: Analysts need timely access to data to make
meaningful observations, which requires a balance between security
and access.
Security Measures
Encryption: Alters data using a unique algorithm, making it
unreadable to unauthorized users. The algorithm key is used to
revert the data to its original form.
Tokenization: Replaces sensitive data with randomly generated
tokens. The original data is stored separately, and to access it, users
must have permission to use the tokenized data and the token
mapping.
Other Security Measures
26
Companies may also use additional tools like authentication devices
for AI technology or hire third-party security teams to manage
systems.
Role of Data Analysts
As a junior data analyst, you may not be responsible for
implementing security systems, but understanding their importance
and systems like encryption and tokenization is key.
Version Control
Version Control: Helps track changes in collaborative files, ensuring
no one accidentally overwrites others’ work. It’s essential for
effective teamwork, allowing analysts to experiment and track
progress without losing work.
Key Takeaways
Data security and accessibility need to be balanced.
Encryption and tokenization are standard security methods.
Version control is crucial for collaboration and preventing errors in
shared files.
4) Process data from dirty to clean
4.1) Module 1 – The importance of integrity
More about data integrity and compliance
Data constraints and examples
As you progress in your data journey, you'll come across many types of
data constraints (or criteria that determine validity). The table below
offers definitions and examples of data constraint terms you might come
across.
Data Definition Examples
constraint
Data type Values must be of a If the data type is a date, a single number like 30
certain type: date, would fail the constraint and be invalid
number, percentage,
Boolean, etc.
Data range Values must fall between If the data range is 10-20, a value of 30 would fail
27
Data Definition Examples
constraint
predefined maximum and the constraint and be invalid
minimum values
Mandatory Values can’t be left blank If age is mandatory, that value must be filled in
or empty
Unique Values can’t have a Two people can’t have the same mobile phone
duplicate number within the same service area
Regular Values must match a A phone number must match ###-###-#### (no
expression prescribed pattern other characters allowed)
(regex)
patterns
Cross-field Certain conditions for Values are percentages and values from multiple
validation multiple fields must be fields must add up to 100%
satisfied
Primary-key (Databases only) value A database table can’t have two rows with the
must be unique per same primary key value. A primary key is an
column identifier in a database that references a column in
which each value is unique. More information about
primary and foreign keys is provided later in the
program.
Set- (Databases only) values Value for a column must be set to Yes, No, or Not
membership for a column must come Applicable
from a set of discrete
values
Foreign-key (Databases only) values In a U.S. taxpayer database, the State column must
for a column must be be a valid state or territory with the set of
unique values coming acceptable values defined in a separate States
from a column in another table
table
Accuracy The degree to which the If values for zip codes are validated by street
data conforms to the location, the accuracy of the data goes up.
actual entity being
measured or described
28
Data Definition Examples
constraint
Completeness The degree to which the If data for personal profiles required hair and eye
data contains all desired color, and both are collected, the data is complete.
components or measures
Consistency The degree to which the If a customer has the same address in the sales
data is repeatable from and repair databases, the data is consistent.
different points of entry
or collection
Calculating Sample Size
Key Terminology
Population: Entire group of interest (e.g., all employees in a
company).
Sample: Subset of the population (e.g., surveyed employees).
Margin of Error: Expected difference between sample and population
results. Smaller margin = closer alignment.
Confidence Level: Likelihood (e.g., 95%) that repeated studies yield
similar results.
Confidence Interval: Range of values (sample result ± margin of
error) where the population result likely falls.
Statistical Significance: Indicates if results are due to chance (higher
significance = less random).
Guidelines for Sample Size
Minimum 30: Based on the Central Limit Theorem (CLT), ensuring
sample averages approximate a normal distribution.
Common Confidence Levels: 95% (standard) or 90% (context-
dependent).
Increase Sample Size When:
o Higher confidence level required.
o Smaller margin of error needed.
o Greater statistical significance desired.
Sample Size Considerations
29
Business Problem Context:
o Example 1: Surveying 200 residents about a new library’s
design may suffice.
o Example 2: Voting intent on library funding may require a
larger sample for accuracy.
Accuracy vs. Cost:
o Larger samples = higher cost but greater precision (critical for
high-stakes studies, e.g., drug trials).
o Smaller samples = cost-effective for low-stakes decisions
(e.g., consumer preferences).
Tools & Best Practices
Sample Size Calculators: Input desired confidence level, margin of
error, and population size to determine optimal sample size.
Validate Representativeness: Ensure samples reflect population
diversity (e.g., geographic, demographic).
Key Takeaways
Always use a minimum sample size of 30 to leverage the CLT.
Adjust sample size based on confidence level, margin of error, and
significance needs.
Balance cost (time, resources) against accuracy requirements.
Use sample size calculators to streamline decisions.
Question undersized samples in critical studies (e.g., policy
decisions, medical research).
Example: Population of 200,000: A 200-person sample may work for
general sentiment but not for precise voting predictions.
Proxy Data & Open Datasets
Proxy Data Examples
Business Scenarios & Proxy Use Cases:
o New car model sales projections: Use website clicks on
car specs as a proxy for early sales estimates.
o Plant-based meat demand forecast: Proxy historical sales
of tofu-based turkey substitutes.
30
o Tourism campaign impact: Use historical airline bookings
post-similar campaigns as a proxy.
o Vaccine contraindications: Use open trial data from
injection-version vaccines to estimate risks for a nasal
vaccine.
Open (Public) Datasets
Sources: Platforms like Kaggle host datasets in multiple formats:
o CSV: Credit card customer data (age, salary, credit limits).
o JSON: Trending YouTube video statistics.
o SQLite: U.S. wildfire records (24 years).
o BigQuery: Google Merchandise Store analytics.
Use Case Example: A clinic uses public trial data from an injected
vaccine to predict contraindications for a nasal version.
Key Takeaways
Proxy Data:
o Use when primary data is unavailable or too new (e.g.,
product launches, campaigns).
o Ensure proxy aligns closely with the target scenario (e.g.,
similar demographics, behavior).
Open Datasets:
o Leverage platforms like Kaggle for diverse, publicly available
datasets.
o Verify data quality: check for duplicates and interpret Null
values (could mean missing data or zero).
Cautions:
o Validate proxy relevance to avoid misleading conclusions.
o Clean datasets before analysis (address duplicates, missing
values).
Margin of Error
Margin of Error (MoE): The maximum expected difference between
sample results and the true population value. Defines a range (confidence
interval) where the population’s true average is likely to lie.
Examples
31
1. Baseball:
o A batter’s swing timing (e.g., missing a 90mph fastball by
10ms) illustrates the margin of error needed to hit the ball.
o MoE represents how close the swing timing is to the “ideal” for
success.
2. Marketing (A/B Testing):
o Testing two email subject lines:
Subject Line A: 5% open rate with a 2% MoE →
Confidence interval = 3%–7%.
Subject Line B: 3% open rate → Confidence interval
overlaps with A (3%–7% vs. 3%).
o Conclusion: No statistical significance between A and B due
to overlapping ranges.
Calculation Components
1. Confidence Level: Likelihood (e.g., 90%, 95%, 99%) that the
sample reflects the population.
2. Population Size: Total group being studied.
3. Sample Size: Subset of the population analyzed.
4. Margin of Error: Derived from the above using calculators
(e.g., Good Calculators, CheckMarket).
Key Takeaways
Purpose: Quantifies uncertainty in sample data to estimate
population trends.
Critical in:
o Surveys: Interpreting voter polls, market research.
o A/B Testing: Determining if differences in results are
meaningful.
Statistical Significance:
o Overlapping confidence intervals → No significant difference.
o Non-overlapping intervals → Likely significant difference.
Important Notes
Confidence Levels:
32
o 95%: Most common (balance between precision and cost).
o 99%: Used in high-stakes fields (e.g., pharmaceuticals).
Sample Size Impact: Larger samples reduce MoE (greater
accuracy).
Practical Use: Always report MoE with survey/test results to
contextualize findings.
Example Formula:
(For quick results, use online calculators with population size, confidence
level, and sample data.)
Dirty Data
Dirty data is incomplete, incorrect, or irrelevant to the problem you’re
solving. It undermines analysis, decision-making, and business outcomes.
Types of Dirty Data
1. Duplicate Data
o Description: Records appearing multiple times.
o Causes: Manual entry errors, batch imports, data migration.
o Harm: Skewed metrics, inflated counts, confusion in
reporting.
2. Outdated Data
o Description: Old, unupdated information.
o Causes: Role/company changes, obsolete systems.
o Harm: Inaccurate insights, poor decision-making.
3. Incomplete Data
o Description: Missing critical fields (e.g., empty customer
addresses).
o Causes: Faulty data collection, entry errors.
o Harm: Reduced productivity, inability to deliver services.
4. Incorrect/Inaccurate Data
33
o Description: Complete but wrong (e.g., fake emails, typos).
o Causes: Human error, mock/fake data.
o Harm: Revenue loss, flawed strategies.
5. Inconsistent Data
o Description: Same data in different formats (e.g., "USA" vs.
"United States").
o Causes: Transfer errors, storage issues.
o Harm: Conflicting insights, customer segmentation failures.
Business Impact
Banking: 15–25% revenue loss due to inaccuracies.
Healthcare: 10–20% duplicate EHRs (electronic health records).
B2B Commerce: 25% database inaccuracies.
Marketing/Sales: 99% of companies prioritize data quality.
Key Takeaways
Dirty data leads to inaccurate insights, poor decisions,
and revenue loss.
Causes: Human error, system obsolescence, improper data
practices.
Mitigation: Implement data quality checks, automate cleaning, and
standardize processes.
Example: A hospital with duplicate EHRs risks misdiagnosis, while a bank
with outdated customer data may approve loans to ineligible applicants.
Pro Tip: Regular audits and validation protocols are critical to maintaining
clean data.
5) Analyze data to answer questions
5.1) Module 3 – VLOOKUP core concepts
VLOOKUP and data aggregation
Core Concept
34
VLOOKUP (Vertical Lookup) searches a column for a specific value
(search_key) and returns corresponding data from another column in
the same row.
Only the first match is returned (even if multiple matches exist).
Key Use Cases
1. Populating data: Example: A store manager uses VLOOKUP to
fetch product details (e.g., name, price) from a product ID.
2. Merging data: Example: A teacher combines attendance records
with grades by looking up student names.
Syntax
VLOOKUP(search_key, range, index, is_sorted)
1. search_key: Value to search for (text, number, or cell reference).
2. range: Cell range to search (first column must contain
the search_key).
o The column to return (index) must be to the right of the
search column.
3. index: Column number (within range) to return data from (e.g., 3 for
the 3rd column in the range).
4. is_sorted:
o FALSE: Exact match (recommended).
o TRUE: Approximate match (requires ascending sort).
Common Errors
#VALUE!: Invalid index (e.g., column number outside the range).
#N/A: No match found for search_key.
Key Takeaways
VLOOKUP is ideal for combining data or fetching details from
large datasets.
Limitations: Search column must be leftmost in the range. Only
returns the first match.
Use FALSE for exact matches to avoid errors.
35
How JOINs work
JOINs in SQL: Key Concepts
1. Purpose: Combine data from multiple tables using related columns
(e.g., department_id).
2. Dataset: Uses employees and departments tables in BigQuery.
JOIN Types Explained
Type Behavior Use Case
INNER Returns only rows with Find employees with assigned
JOIN matches in both tables departments
Returns all left table rows
LEFT List all employees (even
+ matches from right
JOIN unassigned to departments)
table
RIGHT Returns all right table rows List all departments (even if
JOIN + matches from left table empty)
FULL Returns all rows from both
Comprehensive view of
OUTER tables (matches +
employees/departments
JOIN unmatched)
Query Structure
SELECT
employees.name AS employee_name,
employees.role AS employee_role,
36
departments.name AS department_name
FROM
`project_id.employee_data.employees` AS employees
[JOIN_TYPE]
`project_id.employee_data.departments` AS departments
ON employees.department_id = departments.department_id
Replace [JOIN_TYPE] with INNER JOIN, LEFT JOIN, etc.
Use aliases (employees, departments) for readability.
Key Takeaways
INNER JOIN: Most common (matches only).
LEFT/RIGHT JOIN: Preserve all rows from one table.
FULL OUTER JOIN: Rare but useful for complete data audits.
Always specify the join condition (ON ...) to avoid Cartesian
products.
37
38
Useful Links
Understanding the data analytics project life cycle -
https://pingax.com/Data%20Analyst/understanding-data-
analytics-project-life-cycle/
Keyboard shortcuts in Excel - https://support.microsoft.com/en-
us/office/keyboard-shortcuts-in-excel-1798d9d5-842a-42b8-9c99-
9b7213f0040f
Keyboard shortcuts for Google Sheets:
https://support.google.com/docs/answer/181110
Explore public datasets - https://www.coursera.org/learn/data-
preparation/supplement/8yrhM/explore-public-datasets
A Gentle Introduction to Statistical Power and Power Analysis in
Python - https://machinelearningmastery.com/statistical-power-
and-power-analysis-in-python/