Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views38 pages

Google Data Analytics Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views38 pages

Google Data Analytics Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

1

Index
1) Foundations: Data, Data, Everywhere.....................................................2
1.1) Module 1 – Introducing data analytics & analytical thinking.............2
Roadmap...............................................................................................2
Google Data Analytics Certificate glossary..............................................6
1.2) Module 2 – The wonderful world of data...........................................6
Variations of the data life cycle.............................................................6
Phases of data analysis.........................................................................7
Key data analyst tools...........................................................................7
Difference between Spreadsheets & Databases...................................8
2) Ask Questions to Make Data-Driven Decisions........................................9
2.1) Module 1 – Ask effective questions...................................................9
Common problem types........................................................................9
SMART questions.................................................................................10
2.2) Module 2 – Ask effective questions.................................................10
Qualitative and quantitative data in business.....................................10
Big and small data..............................................................................11
2.3) Module 3 - Work with spreadsheets................................................11
Spreadsheet errors and fixes..............................................................11
Pro Tip: Spotting errors in spreadsheets with conditional formatting..12
Difference between formulas and functions........................................13
Context can turn raw data into meaningful information.....................13
3) Prepare data for Exploration.................................................................13
3.1) Module 1 - Data types and structures.............................................13
Differentiate data types and structures..............................................13
Data modeling levels and techniques.................................................15
Use Boolean logic................................................................................16
Transforming Data...............................................................................17
3.2) Module 2 - Data Responsibility........................................................17
Identify good data sources..................................................................17
Essential data ethics...........................................................................18
2

Data anonymization............................................................................19
The Open Data Debate.......................................................................19
3.3) Module 3 – Work with databases.....................................................20
Metadata is as important as the data itself.........................................21
Metadata and Metadata Repositories..................................................22
Useful Links...............................................................................................24

1) Foundations: Data, Data, Everywhere


1.1) Module 1 – Introducing data analytics & analytical
thinking
Roadmap
1) Foundation
What you will learn:

 Real-life roles and responsibilities of a junior data analyst

 How businesses transform data into actionable insights

 Spreadsheet basics

 Database and query basics

 Data visualization basics

Skill sets you will build:

 Using data in everyday life

 Thinking analytically

 Applying tools from the data analytics toolkit

 Showing trends and patterns with data visualizations

 Ensuring your data analysis is fair

2) Ask
What you will learn:

 How data analysts solve problems with data

 The use of analytics for making data-driven decisions

 Spreadsheet formulas and functions


3

 Dashboard basics, including an introduction to Tableau

 Data reporting basics

Skill sets you will build:

 Asking SMART and effective questions

 Structuring how you think

 Summarizing data

 Putting things into context

 Managing team and stakeholder expectations

 Problem-solving and conflict-resolution

3) Prepare
What you will learn:

 How data is generated

 Features of different data types, fields, and values

 Database structures

 The function of metadata in data analytics

 Structured Query Language (SQL) functions

Skill sets you will build:

 Ensuring ethical data analysis practices

 Addressing issues of bias and credibility

 Accessing databases and importing data

 Writing simple queries

 Organizing and protecting data

 Connecting with the data community (optional)

4) Process
What you will learn:

 Data integrity and the importance of clean data

 The tools and processes used by data analysts to clean data

 Data-cleaning verification and reports


4

 Statistics, hypothesis testing, and margin of error

 Resume building and interpretation of job postings (optional)

Skill sets you will build:

 Connecting business objectives to data analysis

 Identifying clean and dirty data

 Cleaning small datasets using spreadsheet tools

 Cleaning large datasets by writing SQL queries

 Documenting data-cleaning processes

5) Analyse
What you will learn:

 Steps data analysts take to organize data

 How to combine data from multiple sources

 Spreadsheet calculations and pivot tables

 SQL calculations

 Temporary tables

 Data validation

Skill sets you will build:

 Sorting data in spreadsheets and by writing SQL queries

 Filtering data in spreadsheets and by writing SQL queries

 Converting data

 Formatting data

 Substantiating data analysis processes

 Seeking feedback and support from others during data analysis

6) Share
What you will learn:

 Design thinking

 How data analysts use visualizations to communicate about data


5

 The benefits of Tableau for presenting data analysis findings

 Data-driven storytelling

 Dashboards and dashboard filters

 Strategies for creating an effective data presentation

Skill sets you will build:

 Creating visualizations and dashboards in Tableau

 Addressing accessibility issues when communicating about data

 Understanding the purpose of different business communication


tools

 Telling a data-driven story

 Presenting to others about data

 Answering questions about data

7) Act
What you will learn:

 Programming languages and environments

 R packages

 R functions, variables, data types, pipes, and vectors

 R data frames

 Bias and credibility in R

 R visualization tools

 R Markdown for documentation, creating structure, and emphasis

Skill sets you will build:

 Coding in R

 Writing functions in R

 Accessing data in R

 Cleaning data in R

 Generating data visualizations in R

 Reporting on data analysis to stakeholders


6

8) Capstone
What you will learn:

 How a data analytics portfolio distinguishes you from other


candidates

 Practical, real-world problem-solving

 Strategies for extracting insights from data

 Clear presentation of data findings

 Motivation and ability to take initiative

Skill sets you will build:

 Building a portfolio

 Increasing your employability

 Showcasing your data analytics knowledge, skill, and technical


expertise

 Sharing your work during an interview

 Communicating your unique value proposition to a potential


employer

Google Data Analytics Certificate glossary


C:\Users\sahil\Downloads\Google Data Analytics Certificate glossary.docx

1.2) Module 2 – The wonderful world of data


Variations of the data life cycle
There are six stages to the data life cycle. Here's a recap:

1. Plan: Decide what kind of data is needed, how it will be


managed, and who will be responsible for it.

2. Capture: Collect or bring in data from a variety of different sources.

3. Manage: Care for and maintain the data. This includes determining
how and where it is stored and the tools used to do so.

4. Analyze: Use the data to solve problems, make decisions, and


support business goals.
7

5. Archive: Keep relevant data stored for long-term and future


reference.

6. Destroy: Remove data from storage and delete any shared copies
of the data.

Note: Be careful not to confuse the six stages of the data life cycle (plan,
capture, manage, analyze, archive, and destroy) with the six phases of the
data analysis process (ask, prepare, process, analyze, share, and act).
They are not interchangeable.

One data management principle is universal: Govern how data is handled


so that it is accurate, secure, and available to meet your organization's
needs.

Phases of data analysis

1. Ask Phase: Understand stakeholder expectations, define the


problem, and decide on key questions to solve the problem. Focus
on who the stakeholders are, what they want, and how to
communicate with them.
2. Prepare Phase: Identify and locate relevant data. Ensure the data
is objective and unbiased to support fair decision-making.
3. Process Phase: Clean, transform, and combine datasets to
eliminate errors, inaccuracies, and outliers, ensuring the data is
accurate and complete.
4. Analyze Phase: Analyze the prepared data using tools like
spreadsheets, SQL, and programming languages (e.g., R) to turn
data into actionable insights.
5. Share Phase: Share findings with stakeholders using data
visualizations to communicate insights effectively, aiding data-
driven decision-making.
6. Act Phase: Apply insights by completing a case study project and
preparing for the job search, demonstrating your skills and adding
to your portfolio.

Key data analyst tools


Spreadsheets:

 Used for collecting, organizing, sorting, and storing data


8

 Helps identify patterns and structure data for specific projects

 Enables creation of data visualizations (graphs, charts)

 Popular tools: Microsoft Excel, Google Sheets

Databases and Query Languages:

 Databases store structured data (e.g., MySQL, Microsoft SQL Server,


BigQuery)

 Query languages (SQL) allow analysts to:

o Isolate specific data

o Understand database requests

o Select, create, add, or download data for analysis

Visualization Tools:

 Turn complex data into understandable visuals (graphs, charts,


maps, etc.)

 Help stakeholders make informed decisions and business strategies

 Popular tools: Tableau, Looker

o Tableau: Drag-and-drop feature for creating interactive


graphs and dashboards

o Looker: Connects directly to databases for real-time data


visualization

Programming Languages:

 R and Python used for statistical analysis, visualization, and


advanced data analysis

Key Takeaway: Data analysts have many tools to choose from, which will
be explored in depth throughout the program.

Difference between Spreadsheets & Databases


Spreadsheets Databases

Accessed through a software application Database accessed using a query language

Structured data in a row and column format Structured data using rules and relationships

Organizes information in cells Organizes information in complex collections

Provides access to a limited amount of data Provides access to huge amounts of data
9

Spreadsheets Databases

Manual data entry Strict and consistent data entry

Generally, one user at a time Multiple users

Controlled by the user Controlled by a database management system

___________________________________________________________________________
______

2) Ask Questions to Make Data-Driven


Decisions
2.1) Module 1 – Ask effective questions
Common problem types
Data analysts work with a variety of problems. Six common types include:

1. Making Predictions: Analysts use data on past ads (location,


media type, customer acquisition) to predict the best advertising
methods for reaching target audiences, though past data can't
guarantee future results.
2. Categorizing Things: Analysts classify customer service calls
using keywords or scores to identify top-performing representatives
or correlate actions with customer satisfaction.
3. Spotting Something Unusual: Analysts help develop software for
smartwatches to detect unusual health data patterns and set off
alarms when trends deviate from the norm.
4. Identifying Themes: Analysts group categories into broader
themes to prioritize product features for improvement, such as user
beliefs, practices, and needs in UX studies.
5. Discovering Connections: Analysts analyze shipping hub wait
times to determine schedule adjustments that improve on-time
deliveries in logistics.
6. Finding Patterns: Analysts examine maintenance data to discover
patterns, such as the impact of delayed maintenance on machine
failure.

Key Takeaway: Developing an analytical mindset helps in identifying


the type of problem and creating solutions that address the needs of
stakeholders.
10

SMART questions

Note: open-ended questions are recommended to gather more insightful


data, such as asking about the importance of four-wheel drive or the most
desired car features. This method helps ensure that responses directly
address the problem and provide valuable insights.

2.2) Module 2 – Ask effective questions


Qualitative and quantitative data in business
Key Takeaways:

 Data analysts use both quantitative and qualitative data in their


work.
 Quantitative data provides the "what" (e.g., attendance, profit,
showtimes).
 Qualitative data provides the "why" (e.g., reasons behind
decisions or preferences).
 Combining both types of data gives a fuller understanding of trends
and behaviors.
 Example: Understanding why people prefer certain theaters, like
those with reclining chairs or unique offerings (e.g., root beer).
11

 Qualitative insights can help make decisions, like purchasing more


recliners or adjusting showtimes.
 Without qualitative data, analysts would miss key insights behind
the numbers.

Big and small data

Small data Big data

Describes a dataset made up of specific Describes large, less-specific datasets that


metrics over a short, well-defined time
cover a long time period
period

Usually organized and analyzed in Usually kept in a database and queried


spreadsheets

Likely to be used by small and midsize Likely to be used by large organizations


businesses

Simple to collect, store, manage, sort, Takes a lot of effort to collect, store,
and visually represent manage, sort,

and visually represent

Usually already a manageable size for Usually needs to be broken into smaller
analysis pieces in order to be organized and
analyzed effectively for decision-making

The three (or four) V words for big data:

 Volume: Refers to the sheer amount of data.


 Variety: Describes the different types of data.
 Velocity: Refers to the speed at which data is processed.
 Veracity (optional fourth V): Focuses on the quality and reliability of
the data.
 These factors are crucial when working with large, complex datasets
12

2.3) Module 3 - Work with spreadsheets


Spreadsheet errors and fixes
Error Description Example

#DIV/ A formula is trying to divide a value in =B2/B3, when the cell B3 contains the value 0
0! a cell by 0 (or an empty cell with no
value)

#ERRO (Google Sheets only) Something can’t =COUNT(B1:D1 C1:C10) is invalid because
R! be interpreted as it has been input. the cell ranges aren't separated by a comma
This is also known as a parsing error.

#N/A A formula can't find the data The cell being referenced can't be found

#NAM The name of a formula or function The name of a function is misspelled


E? used isn't recognized

#NUM! The spreadsheet can't perform a =DATEDIF(A4, B4, "M") is unable to calculate
formula calculation because a cell has the number of months between two dates
an invalid numeric value because the date in cell A4 falls after the date
in cell B4

#REF! A formula is referencing a cell that A cell used in a formula was in a column that
isn't valid was deleted

#VALU A general error indicating a problem There could be problems with spaces or text, or
E! with a formula or with referenced cells with referenced cells in a formula; you may
have additional work to find the source of the
problem.

Pro Tip: Spotting errors in spreadsheets with conditional


formatting
Conditional formatting can be used to highlight cells a different color
based on their contents. This feature can be extremely helpful when you
want to locate all errors in a large spreadsheet.
Conditional formatting in Microsoft Excel
To set up conditional formatting in Microsoft Excel to highlight all cells in a
spreadsheet that contain errors, do the following:
1. Click the gray triangle above row number 1 and to the left of
Column A to select all cells in the spreadsheet.
2. From the main menu, click Home, and then click Conditional
Formatting to select Highlight Cell Rules > More Rules.
13

3. For Select a Rule Type, choose Use a formula to determine


which cells to format.
4. For Format values where this formula is true, enter =ISERROR(A1).
5. Click the Format button, select the Fill tab, select yellow (or any
other color), and then click OK.
6. Click OK to close the format rule window.
To remove conditional formatting, click Home and select Conditional
Formatting, and then click Manage Rules. Locate the format rule in the
list, click Delete Rule, and then click OK.
Conditional formatting in Google Sheets
To set up conditional formatting in Google Sheets to highlight all cells in a
spreadsheet that contain errors, do the following:
1. Click the empty rectangle above row number 1 and to the left of
Column A to select all cells in the spreadsheet. In the Step-by-step
in spreadsheets video, this was called the Select All button.
2. From the main menu, click Format and select Conditional
Formatting to open the Conditional format rules pane on the right.
3. While in the Single Color tab, under Format rules, use the drop-
down to select Custom formula is, enter =ISERROR(A1), select
yellow (or any other color) for the formatting style, and then click
Done.
To remove conditional formatting, click Format and select Conditional
Formatting, and then click the Trash icon for the format rule.
Difference between formulas and functions
 A formula is a set of instructions used to perform a calculation using
the data in a spreadsheet.

 A function is a preset command that automatically performs a


specific process or task using the data in a spreadsheet.

Context can turn raw data into meaningful information.


It is very important for data analysts to contextualize their data. This
means giving the data perspective by defining it. To do this, you need to
identify:

 Who: The person or organization that created, collected, and/or


funded the data collection

 What: The things in the world that data could have an impact on

 Where: The origin of the data

 When: The time when the data was created or collected


14

 Why: The motivation behind the creation or collection

 How: The method used to create or collect it

___________________________________________________________________________
______

3) Prepare data for Exploration


3.1) Module 1 - Data types and structures
Differentiate data types and structures
Primary vs. Secondary Data

 Primary Data: Collected firsthand by a researcher (e.g., interview


data, survey responses, questionnaire data).

 Secondary Data: Gathered by others or from other research (e.g.,


purchased customer profiles, demographic data, census data).

Internal vs. External Data

 Internal Data: Stored within a company’s own systems (e.g.,


employee wages, sales data, product inventory levels).

 External Data: Stored outside the company or organization (e.g.,


national wage averages, customer credit reports).

Continuous vs. Discrete Data

 Continuous Data: Measured with almost any numeric value (e.g.,


height, video runtime, temperature).

 Discrete Data: Counted and has limited values (e.g., daily hospital
visitors, room capacity, tickets sold).

Qualitative vs. Quantitative Data

 Qualitative Data: Subjective and explanatory (e.g., favorite


exercise activity, customer service brands, fashion preferences).

 Quantitative Data: Objective and measurable (e.g., percentage of


women doctors, elephant population, distance to Mars).

Nominal vs. Ordinal Data

 Nominal Data: Categorized without order (e.g., customer type, job


applicant type, property status).

 Ordinal Data: Categorized with a set order or scale (e.g., movie


ratings, ranked-choice voting, satisfaction levels).
15

Structured vs. Unstructured Data

 Structured Data: Organized in rows and columns (e.g., expense


reports, tax returns, inventory).

 Unstructured Data: Cannot be organized in a relational database


(e.g., social media posts, emails, videos).

Data modeling levels and techniques


 Data Modeling: The process of creating visual diagrams to
represent how data is organized and structured.

 Analogy: Like a house blueprint—different users (analysts,


engineers, etc.) use the model to understand the data structure.

 Purpose: Helps stakeholders understand and use data efficiently.

Levels of Data Modeling

 Conceptual Data Modeling:

o High-level view of data structure.

o Focuses on how data interacts across the organization.


16

o Does not include technical details.

o Example: Defines business requirements for a new database.

 Logical Data Modeling:

o Focuses on technical details (e.g., relationships, attributes,


and entities).

o Defines how records are uniquely identified but not the exact
table names.

 Physical Data Modeling:

o Focuses on how the database operates.

o Includes specific details like table names, column names, and


data types.

Data Modeling Techniques

 Entity Relationship Diagram (ERD): Visualizes relationships


between entities in a data model.

 Unified Modeling Language (UML): Detailed diagram that


describes system entities, attributes, operations, and relationships.

 Note for Junior Analysts: You'll likely use your organization's


existing modeling technique, but understanding the various
methods is important.

Data Modeling and Data Analysis

 Data Modeling's Role: Helps explore high-level data details and


relationships within an organization’s systems.

 Data Analysis: Often required to understand how data is structured


and to create effective data models.

 Collaboration: Data models help facilitate understanding and


collaboration across teams, enhancing overall communication.

Use Boolean logic

Boolean Logic: A system of logic used in data analysis and programming


to filter results based on conditions.

Operators:
17

 AND: Both conditions must be true for the result to be true.


 OR: At least one condition must be true for the result to be true.
 NOT: Excludes a specific condition, making the result true only
when the condition is false.

Truth Tables: Used to show how conditions work together in Boolean


logic.

Power of Multiple Conditions: Combine different conditions using


operators and parentheses to filter results, e.g., IF ((Color = "Grey") OR
(Color = "Pink")) AND (Waterproof="True").

Key Takeaways:

 Boolean operators help filter data by combining multiple conditions.


 Used widely in data queries and programming for more refined
results.

Transforming Data
Why Transform Data?:

 Data Organization: Improves usability


 Data Compatibility: Enables use across different systems
 Data Migration: Ensures compatibility when moving data between
systems
 Data Merging: Combines data from multiple sources
 Data Enhancement: Adds more detailed fields
 Data Comparison: Allows apples-to-apples comparisons

Long Data: Each row represents a single data point for a specific item,
often with multiple rows for the same item over time or across different
variables (e.g., multiple stock prices for a company on different dates).
Wide Data: Each row contains multiple data points for a specific item,
where each variable (e.g., stock prices for different companies) is
represented as a separate column. This format is often used for easier
comparison across items.
Wide Data is preferred for:
 Creating tables and charts with few variables per subject
 Comparing straightforward line graphs
Long Data is preferred for:
18

 Storing many variables about each subject (e.g., years of interest


rates)
 Performing advanced statistical analysis or graphing

3.2) Module 2 - Data Responsibility


Identify good data sources
The ROCCC Method for Identifying Good Data:

o R – Reliable: Trustworthy, accurate, complete, unbiased, and


vetted data.

o O – Original: Always validate data with the original source,


even if discovered via a third party.

o C – Comprehensive: Data should contain all critical


information needed for analysis, like a full company review,
before making decisions.

o C – Current: The data should be up-to-date and relevant to


the task (e.g., not using outdated client lists).

o C – Cited: Ensure data is credible by checking the original


sources and citations.

Three Key Questions to Evaluate Data:

o Who created the data set?

o Is the source part of a credible organization?

o When was the data last refreshed?

Reliable Data Sources: Public data sets, academic papers, financial


data, and governmental agency data are great places to find vetted,
reliable data.

Essential data ethics


Ethics Definition: A set of principles to live by, helping individuals
navigate moral decisions in life. As people grow older, their personal
code of ethics becomes more rational and helps them face challenges
and opportunities.

Data Ethics: An extension of general ethics focused on data, guiding


how data is collected, shared, and used. It addresses issues like
privacy, transparency, and accountability.

Key Aspects of Data Ethics (6 main topics):


19

1. Ownership: Individuals own the raw data they provide, not


the organizations that collect and process it. Individuals have
control over how their data is used and shared.

2. Transaction Transparency: Data processing and algorithms


should be understandable and explainable to the individuals
providing the data. This ensures fairness and helps avoid
biased results.

3. Consent: Individuals should have clear, explicit information


about how their data will be used before agreeing to provide
it. Consent should prevent unfair targeting, especially for
marginalized groups.

4. Currency: Individuals should be informed about financial


transactions resulting from their data, including the scale and
purpose of these transactions, and should be able to opt out.

5. Privacy and Openness: These aspects are essential and will


be explored further later in the video.

Data anonymization
What is Data Anonymization? : The process of protecting private or
sensitive data by removing personally identifiable information (PII).
This is done by techniques like blanking, hashing, or masking personal
data.

Methods:

o Blanking: Removing certain parts of the data.

o Hashing: Converting information into a fixed-length code.

o Masking: Altering the values of the data.

Sensitive Data: Healthcare and financial data are some of the most
sensitive and require anonymization techniques, such as de-identification
(removal of PII).

Common Types of Data That Should Be Anonymized:

Telephone numbers, Names, License plates and license numbers, Social


Security numbers, IP addresses. Medical records, Email addresses,
Photographs, Account numbers

Importance of Anonymization:
20

 Protecting privacy and ensuring safety by keeping sensitive data


secure.

 Prevents data from being used to track or identify individuals, thus


preserving privacy.

The Open Data Debate


Open data refers to data that is freely available, accessible, and
shareable, allowing for reuse and redistribution.

To be considered open data, it must meet three key criteria:

o Available and Accessible: The data must be publicly


available as a complete dataset.

o Reusability: The data must be provided under terms that


allow it to be reused and redistributed.

o Universal Participation: Anyone can use, reuse, and


redistribute the data.

The Open Data Debate: What Data Should Be Publicly Available?

 Benefits of Open Data:

o Wider usage of credible databases can drive scientific


collaboration, research, and decision-making.

o Combining open data with other datasets can expand


analytical capacity.

 Concerns About Privacy:

o Open data must balance the benefits of sharing with the


protection of individual privacy.

o Third-party Data: Collected by entities that do not have a


direct relationship with the data. For example, third parties
might track website visitors to create audience profiles for
targeted advertising.

o Personal Identifiable Information (PII): Data that can


identify individuals (e.g., addresses, social security numbers,
medical records). This data must be kept secure to protect
privacy.

Key Takeaway: While open data can benefit research and decision-
making, it's crucial to ensure the privacy of individuals is protected by
limiting exposure of PII.
21

3.3) Module 3 – Work with databases


Relational Databases:

 Contain tables that can be connected through relationships.

 Help data analysts organize and link data based on common


attributes.

 Simplify data analysis by making data easier to search and use.

Non-Relational Databases: Group all variables together, making data


harder to analyze.

Normalization: A process of organizing data to reduce redundancy,


increase integrity, and simplify complexity in databases. Involves creating
tables and establishing relationships between them.

Key Elements of Relational Databases:

 Primary Key: A unique identifier for each record in a table (e.g.,


customer_id).

 Foreign Key: A field in one table that links to the primary key of
another table.

 Composite Key: A primary key made from multiple columns (e.g.,


customer_id and location_id together).

 The primary key in the Books Table is book_id.


 The primary key in the Borrowers Table is borrower_id.
22

 The composite key in the Loans Table is a combination of


borrower_id and book_id.
 Both borrower_id and book_id in the Loans Table are foreign keys
linking to the respective primary keys in the Borrowers Table and
Books Table.

Tables in Relational Databases: Can include tables like Customer,


Revenue, Branch, Date, and Product.

SQL (Structured Query Language): A tool used by data analysts to


communicate with and query relational databases. Enables users to
extract specific data from related tables in a database.

Metadata is as important as the data itself


What is Metadata? Metadata is data about data. It provides context,
structure, and details about a data file, helping data analysts understand
the content and usage of the data. It answers the who, what, when,
where, why, and how of data.

Elements of Metadata:

 File Type: What type of file is it?

 Date & Time: When was it created or modified? Who created it?

 Title & Description: What is it? What does it contain?

 Geolocation: Where was it created (e.g., photo)?

 Tags & Categories: How is it indexed or described?

 Modification Details: Who last modified it and when?

 Access Permissions: Who can access or update it?

Examples of Metadata:

 Photos: Metadata includes filename, date, time, geolocation, and


device info.

 Emails: Metadata includes sender, recipient, subject line, date, and


time. Hidden metadata includes server names, IP addresses, and
software details.

 Spreadsheets/Docs: Metadata includes title, author, creation date,


page count, and modification history.
23

 Websites: Metadata includes site creator, page title, description,


and tags.

 Books & Audiobooks: Metadata includes title, author, publisher,


and for audiobooks, narrator and length.

Key Takeaways:

 Metadata is essential for organizing and understanding data.

 It helps analysts interpret, reuse, and manage data effectively.

 Accurate metadata management ensures better data retrieval,


usage, and preservation.

Metadata and Metadata Repositories


Benefits of Metadata:

Reliability: Ensures data is accurate, precise, relevant, and timely, helping


analysts make reliable decisions.

Consistency: Metadata promotes uniformity, making data easier to


categorize, clean, store, and access. It ensures data can be compared and
analyzed across different sources.

Metadata Repositories:

Specialized databases that store and manage metadata, either physically


or in the cloud. Help analysts quickly access metadata without manually
searching each file. Describe the source, location, structure, and access
logs of data, making it easier to analyze multiple data sources together.

Metadata of External Databases:

Data analysts use second-party (data directly collected and sold by a


group) and third-party (data provided by external sources) data to gain
insights.

It’s important to confirm the reliability and accessibility of external data,


as well as obtain proper permissions.

Key Takeaways:

 Metadata aids data-driven decision-making by ensuring data


reliability and consistency.
 Metadata repositories store essential information about data
sources, helping analysts use data efficiently and correctly.

Kinds of metadata:
24

 Descriptive Metadata: Describes the content to help identify and


locate data, such as a book's title, author, and keywords.
 Structural Metadata: Defines how data is organized and related,
like database tables and file formats (e.g., CSV, JSON).
 Administrative Metadata: Provides information on data creation,
management, and access, such as file creation date and access
permissions.
 Statistical Metadata: Describes the statistical aspects of data
collection and processing, like survey sampling methods and
measurement units.
 Reference Metadata: Explains the context and definitions of data,
such as a glossary of terms or variable definitions.
 Provenance Metadata: Tracks the history and origin of data, like
the data source and transformation steps.
 Geospatial Metadata: Provides information about the geographic
characteristics of data, such as GPS coordinates and map
projections.

3.4) Module 4 – Organize and protect data


File organization guidelines
Best Practices for Naming Files

 Meaningful and Consistent Names: Use names that describe the


file’s contents and follow a consistent naming structure.

 Include Key Information:

o Project Name: Clearly describe the project (e.g.,


SalesReport).

o Creation Date: Use the YYYYMMDD format to indicate when


the file was created (e.g., 20231125 for November 25, 2023).

o Revision Version: Include versioning with a clear numbering


system (e.g., v02).

 Keep It Short and Simple: Make sure file names are concise and
easily readable.

 File Name Example: SalesReport_20231125_v02.

File-Naming Guidelines

 Avoid Special Characters: Don’t use spaces or special characters.


Use underscores or hyphens instead (e.g.,
SalesReport_2023_11_25_v02).
25

 Consistent Style and Order: Maintain the same order (e.g.,


ProjectName_Date_Version) for all files.

 Ensure Team Consistency: Create a sample text file outlining


naming conventions to help all team members follow the same
format.

File Organization

 Create Logical Folders: Organize files in a hierarchical structure


with broader-topic folders at the top and specific files/subfolders
below.

 Separate Completed and In-Progress Files: Store finished files


apart from ongoing ones to avoid confusion.

 Archive Old Files: Move outdated files to a separate folder or


external storage to keep the workspace uncluttered.

Key Takeaways

 Use consistent, meaningful file-naming conventions to make


data easier to find and organize.

 Agree on a file-naming structure before starting the project.

 Document naming conventions for easy reference by team


members.

Balancing Data Security and Analytics


 Data Security: Protects data from unauthorized access and
corruption to keep sensitive information safe.

 Analytics Needs: Analysts need timely access to data to make


meaningful observations, which requires a balance between security
and access.

Security Measures

 Encryption: Alters data using a unique algorithm, making it


unreadable to unauthorized users. The algorithm key is used to
revert the data to its original form.

 Tokenization: Replaces sensitive data with randomly generated


tokens. The original data is stored separately, and to access it, users
must have permission to use the tokenized data and the token
mapping.

Other Security Measures


26

 Companies may also use additional tools like authentication devices


for AI technology or hire third-party security teams to manage
systems.

Role of Data Analysts

 As a junior data analyst, you may not be responsible for


implementing security systems, but understanding their importance
and systems like encryption and tokenization is key.

Version Control

 Version Control: Helps track changes in collaborative files, ensuring


no one accidentally overwrites others’ work. It’s essential for
effective teamwork, allowing analysts to experiment and track
progress without losing work.

Key Takeaways

 Data security and accessibility need to be balanced.

 Encryption and tokenization are standard security methods.

 Version control is crucial for collaboration and preventing errors in


shared files.

4) Process data from dirty to clean


4.1) Module 1 – The importance of integrity
More about data integrity and compliance
Data constraints and examples

As you progress in your data journey, you'll come across many types of
data constraints (or criteria that determine validity). The table below
offers definitions and examples of data constraint terms you might come
across.

Data Definition Examples


constraint

Data type Values must be of a If the data type is a date, a single number like 30
certain type: date, would fail the constraint and be invalid
number, percentage,
Boolean, etc.

Data range Values must fall between If the data range is 10-20, a value of 30 would fail
27

Data Definition Examples


constraint

predefined maximum and the constraint and be invalid


minimum values

Mandatory Values can’t be left blank If age is mandatory, that value must be filled in
or empty

Unique Values can’t have a Two people can’t have the same mobile phone
duplicate number within the same service area

Regular Values must match a A phone number must match ###-###-#### (no
expression prescribed pattern other characters allowed)
(regex)
patterns

Cross-field Certain conditions for Values are percentages and values from multiple
validation multiple fields must be fields must add up to 100%
satisfied

Primary-key (Databases only) value A database table can’t have two rows with the
must be unique per same primary key value. A primary key is an
column identifier in a database that references a column in
which each value is unique. More information about
primary and foreign keys is provided later in the
program.

Set- (Databases only) values Value for a column must be set to Yes, No, or Not
membership for a column must come Applicable
from a set of discrete
values

Foreign-key (Databases only) values In a U.S. taxpayer database, the State column must
for a column must be be a valid state or territory with the set of
unique values coming acceptable values defined in a separate States
from a column in another table
table

Accuracy The degree to which the If values for zip codes are validated by street
data conforms to the location, the accuracy of the data goes up.
actual entity being
measured or described
28

Data Definition Examples


constraint

Completeness The degree to which the If data for personal profiles required hair and eye
data contains all desired color, and both are collected, the data is complete.
components or measures

Consistency The degree to which the If a customer has the same address in the sales
data is repeatable from and repair databases, the data is consistent.
different points of entry
or collection

Calculating Sample Size


Key Terminology

 Population: Entire group of interest (e.g., all employees in a


company).

 Sample: Subset of the population (e.g., surveyed employees).

 Margin of Error: Expected difference between sample and population


results. Smaller margin = closer alignment.

 Confidence Level: Likelihood (e.g., 95%) that repeated studies yield


similar results.

 Confidence Interval: Range of values (sample result ± margin of


error) where the population result likely falls.

 Statistical Significance: Indicates if results are due to chance (higher


significance = less random).

Guidelines for Sample Size

 Minimum 30: Based on the Central Limit Theorem (CLT), ensuring


sample averages approximate a normal distribution.

 Common Confidence Levels: 95% (standard) or 90% (context-


dependent).

 Increase Sample Size When:

o Higher confidence level required.

o Smaller margin of error needed.

o Greater statistical significance desired.

Sample Size Considerations


29

 Business Problem Context:

o Example 1: Surveying 200 residents about a new library’s


design may suffice.

o Example 2: Voting intent on library funding may require a


larger sample for accuracy.

 Accuracy vs. Cost:

o Larger samples = higher cost but greater precision (critical for


high-stakes studies, e.g., drug trials).

o Smaller samples = cost-effective for low-stakes decisions


(e.g., consumer preferences).

Tools & Best Practices

 Sample Size Calculators: Input desired confidence level, margin of


error, and population size to determine optimal sample size.

 Validate Representativeness: Ensure samples reflect population


diversity (e.g., geographic, demographic).

Key Takeaways

 Always use a minimum sample size of 30 to leverage the CLT.

 Adjust sample size based on confidence level, margin of error, and


significance needs.

 Balance cost (time, resources) against accuracy requirements.

 Use sample size calculators to streamline decisions.

 Question undersized samples in critical studies (e.g., policy


decisions, medical research).

Example: Population of 200,000: A 200-person sample may work for


general sentiment but not for precise voting predictions.

Proxy Data & Open Datasets


Proxy Data Examples

 Business Scenarios & Proxy Use Cases:

o New car model sales projections: Use website clicks on


car specs as a proxy for early sales estimates.

o Plant-based meat demand forecast: Proxy historical sales


of tofu-based turkey substitutes.
30

o Tourism campaign impact: Use historical airline bookings


post-similar campaigns as a proxy.

o Vaccine contraindications: Use open trial data from


injection-version vaccines to estimate risks for a nasal
vaccine.

Open (Public) Datasets

 Sources: Platforms like Kaggle host datasets in multiple formats:

o CSV: Credit card customer data (age, salary, credit limits).

o JSON: Trending YouTube video statistics.

o SQLite: U.S. wildfire records (24 years).

o BigQuery: Google Merchandise Store analytics.

 Use Case Example: A clinic uses public trial data from an injected
vaccine to predict contraindications for a nasal version.

Key Takeaways

 Proxy Data:

o Use when primary data is unavailable or too new (e.g.,


product launches, campaigns).

o Ensure proxy aligns closely with the target scenario (e.g.,


similar demographics, behavior).

 Open Datasets:

o Leverage platforms like Kaggle for diverse, publicly available


datasets.

o Verify data quality: check for duplicates and interpret Null


values (could mean missing data or zero).

 Cautions:

o Validate proxy relevance to avoid misleading conclusions.

o Clean datasets before analysis (address duplicates, missing


values).

Margin of Error
Margin of Error (MoE): The maximum expected difference between
sample results and the true population value. Defines a range (confidence
interval) where the population’s true average is likely to lie.

Examples
31

1. Baseball:

o A batter’s swing timing (e.g., missing a 90mph fastball by


10ms) illustrates the margin of error needed to hit the ball.

o MoE represents how close the swing timing is to the “ideal” for
success.

2. Marketing (A/B Testing):

o Testing two email subject lines:

 Subject Line A: 5% open rate with a 2% MoE →


Confidence interval = 3%–7%.

 Subject Line B: 3% open rate → Confidence interval


overlaps with A (3%–7% vs. 3%).

o Conclusion: No statistical significance between A and B due


to overlapping ranges.

Calculation Components

1. Confidence Level: Likelihood (e.g., 90%, 95%, 99%) that the


sample reflects the population.

2. Population Size: Total group being studied.

3. Sample Size: Subset of the population analyzed.

4. Margin of Error: Derived from the above using calculators


(e.g., Good Calculators, CheckMarket).

Key Takeaways

 Purpose: Quantifies uncertainty in sample data to estimate


population trends.

 Critical in:

o Surveys: Interpreting voter polls, market research.

o A/B Testing: Determining if differences in results are


meaningful.

 Statistical Significance:

o Overlapping confidence intervals → No significant difference.

o Non-overlapping intervals → Likely significant difference.

Important Notes

 Confidence Levels:
32

o 95%: Most common (balance between precision and cost).

o 99%: Used in high-stakes fields (e.g., pharmaceuticals).

 Sample Size Impact: Larger samples reduce MoE (greater


accuracy).

 Practical Use: Always report MoE with survey/test results to


contextualize findings.

Example Formula:

(For quick results, use online calculators with population size, confidence
level, and sample data.)

Dirty Data
Dirty data is incomplete, incorrect, or irrelevant to the problem you’re
solving. It undermines analysis, decision-making, and business outcomes.

Types of Dirty Data

1. Duplicate Data

o Description: Records appearing multiple times.

o Causes: Manual entry errors, batch imports, data migration.

o Harm: Skewed metrics, inflated counts, confusion in


reporting.

2. Outdated Data

o Description: Old, unupdated information.

o Causes: Role/company changes, obsolete systems.

o Harm: Inaccurate insights, poor decision-making.

3. Incomplete Data

o Description: Missing critical fields (e.g., empty customer


addresses).

o Causes: Faulty data collection, entry errors.

o Harm: Reduced productivity, inability to deliver services.

4. Incorrect/Inaccurate Data
33

o Description: Complete but wrong (e.g., fake emails, typos).

o Causes: Human error, mock/fake data.

o Harm: Revenue loss, flawed strategies.

5. Inconsistent Data

o Description: Same data in different formats (e.g., "USA" vs.


"United States").

o Causes: Transfer errors, storage issues.

o Harm: Conflicting insights, customer segmentation failures.

Business Impact

 Banking: 15–25% revenue loss due to inaccuracies.

 Healthcare: 10–20% duplicate EHRs (electronic health records).

 B2B Commerce: 25% database inaccuracies.

 Marketing/Sales: 99% of companies prioritize data quality.

Key Takeaways

 Dirty data leads to inaccurate insights, poor decisions,


and revenue loss.

 Causes: Human error, system obsolescence, improper data


practices.

 Mitigation: Implement data quality checks, automate cleaning, and


standardize processes.

Example: A hospital with duplicate EHRs risks misdiagnosis, while a bank


with outdated customer data may approve loans to ineligible applicants.

Pro Tip: Regular audits and validation protocols are critical to maintaining
clean data.

5) Analyze data to answer questions


5.1) Module 3 – VLOOKUP core concepts
VLOOKUP and data aggregation
Core Concept
34

 VLOOKUP (Vertical Lookup) searches a column for a specific value


(search_key) and returns corresponding data from another column in
the same row.

 Only the first match is returned (even if multiple matches exist).

Key Use Cases

1. Populating data: Example: A store manager uses VLOOKUP to


fetch product details (e.g., name, price) from a product ID.

2. Merging data: Example: A teacher combines attendance records


with grades by looking up student names.

Syntax

VLOOKUP(search_key, range, index, is_sorted)

1. search_key: Value to search for (text, number, or cell reference).

2. range: Cell range to search (first column must contain


the search_key).

o The column to return (index) must be to the right of the


search column.

3. index: Column number (within range) to return data from (e.g., 3 for
the 3rd column in the range).

4. is_sorted:

o FALSE: Exact match (recommended).

o TRUE: Approximate match (requires ascending sort).

Common Errors

 #VALUE!: Invalid index (e.g., column number outside the range).

 #N/A: No match found for search_key.

Key Takeaways

 VLOOKUP is ideal for combining data or fetching details from


large datasets.

 Limitations: Search column must be leftmost in the range. Only


returns the first match.

 Use FALSE for exact matches to avoid errors.


35

How JOINs work


JOINs in SQL: Key Concepts

1. Purpose: Combine data from multiple tables using related columns


(e.g., department_id).

2. Dataset: Uses employees and departments tables in BigQuery.

JOIN Types Explained

Type Behavior Use Case

INNER Returns only rows with Find employees with assigned


JOIN matches in both tables departments

Returns all left table rows


LEFT List all employees (even
+ matches from right
JOIN unassigned to departments)
table

RIGHT Returns all right table rows List all departments (even if
JOIN + matches from left table empty)

FULL Returns all rows from both


Comprehensive view of
OUTER tables (matches +
employees/departments
JOIN unmatched)

Query Structure

SELECT

employees.name AS employee_name,

employees.role AS employee_role,
36

departments.name AS department_name

FROM

`project_id.employee_data.employees` AS employees

[JOIN_TYPE]

`project_id.employee_data.departments` AS departments

ON employees.department_id = departments.department_id

 Replace [JOIN_TYPE] with INNER JOIN, LEFT JOIN, etc.

 Use aliases (employees, departments) for readability.

Key Takeaways

 INNER JOIN: Most common (matches only).

 LEFT/RIGHT JOIN: Preserve all rows from one table.

 FULL OUTER JOIN: Rare but useful for complete data audits.

 Always specify the join condition (ON ...) to avoid Cartesian


products.
37
38

Useful Links
 Understanding the data analytics project life cycle -
https://pingax.com/Data%20Analyst/understanding-data-
analytics-project-life-cycle/
 Keyboard shortcuts in Excel - https://support.microsoft.com/en-
us/office/keyboard-shortcuts-in-excel-1798d9d5-842a-42b8-9c99-
9b7213f0040f
 Keyboard shortcuts for Google Sheets:
https://support.google.com/docs/answer/181110
 Explore public datasets - https://www.coursera.org/learn/data-
preparation/supplement/8yrhM/explore-public-datasets
 A Gentle Introduction to Statistical Power and Power Analysis in
Python - https://machinelearningmastery.com/statistical-power-
and-power-analysis-in-python/

You might also like