BDA Notes-1
BDA Notes-1
We live in a "data revolution" era where enormous amounts of data are generated daily
from activities like social media, online transactions, sports, healthcare, and
entertainment.
"Until a few years ago, there were five exabytes of information created from the dawn of
civilization through 2003. Now that much information is created every two days." — Eric
Schmidt, former Google executive chairman
Medical prescriptions
Retail bills
1/29
Visual Example:
lua
3. Analyzing Data
Definition: Using statistical and mathematical techniques to identify patterns, trends,
and relationships in the data.
Types of Analysis:
Example:
Correlation: "Users who watch more than 10 videos per week are 30% more likely to
make a purchase."
Visual Example:
markdown
| Sales ($) |
| | ■
| | ■ ■
| | ■ ■ ■
|___________|____________
Jan Feb Mar
2/29
Tools: Statistical models, data visualization (bar charts, pie charts), sometimes AI for
advanced predictions.
Methods:
Example:
Visual Example:
pgsql
Goal: Make insights understandable to someone who wasn't involved in the analysis and
enable real-world action.
Gathering Data Collect relevant data from various Downloading sales and web traffic
sources logs
3/29
Step Purpose Example
Cleaning & Remove errors, standardize, group Fixing date formats, removing
Organizing data duplicates
Analyzing Data Find patterns, trends, relationships Calculating average sales per region
Interpreting Results Present insights for decision-making Creating a dashboard for managers
Key Takeaways
Data analytics transforms raw data into actionable insights through a systematic
process.
Real-world examples include business sales analysis, sports performance tracking, and
social media engagement optimization.
Next Steps:
In future lessons, you will explore each stage in greater detail and gain hands-on experience
with real datasets and tools.
1. Business
How Data Analytics is Used:
4/29
Analyzing past performance (e.g., profits, sales trends)
Making informed strategic decisions (e.g., launching new products, setting targets)
Example:
A company reviews sales data from the last quarter to decide whether to launch a new
ad campaign. They analyze which products sold best and when, helping them choose
the optimal time for the campaign.
css
2. Manufacturing
How Data Analytics is Used:
Example:
A factory uses analytics to predict when a machine is likely to break down, allowing for
maintenance before a costly failure occurs.
css
5/29
3. Retail
How Data Analytics is Used:
Example:
Amazon uses customer purchase history and browsing patterns to recommend products
and optimize delivery routes.
css
4. Healthcare
How Data Analytics is Used:
Example:
During the COVID-19 pandemic, analytics was used to monitor and predict outbreaks,
helping hospitals prepare and allocate resources.
css
6/29
[Patient Data] → [Pattern Recognition] → [Outbreak Prediction] → [Resource
Allocation]
5. Education
How Data Analytics is Used:
Example:
Schools use platforms like Intellischool to visualize student performance, identify those
at risk, and tailor interventions.
css
6. Banking
How Data Analytics is Used:
Example:
7/29
Banks use predictive analytics to identify potentially fraudulent transactions in real-time,
reducing losses and increasing security.
css
Key Takeaways
Data analytics is integral across sectors, enabling better, data-driven decisions.
The process typically involves collecting data, analyzing it for patterns, and using those
insights for targeted actions.
8/29
Visual Summary: Data Analytics Across Sectors
less
Data analytics is the backbone of modern decision-making, transforming raw data into
actionable insights across every major sector of the economy.
Example Visual:
css
9/29
[Stone with tally marks] → [Track days for planting]
Ancient Civilizations:
John Graunt: Analyzed mortality data in London, foundational for public health statistics.
pgsql
10/29
John Snow: Mapped cholera outbreaks in London, identifying contaminated water
sources.
Example Visuals:
csharp
javascript
[Map with dots showing cholera cases clustered around a water pump]
Example: US Census Bureau used computers to process census data much faster
than by hand.
Example Visual:
css
Statistical Software: SPSS, SAS, and the spread of spreadsheets (Excel) democratized
data analysis.
Relational Databases & SQL: Efficient storage and retrieval of complex data.
11/29
6. The Digital Revolution: 1990s–2000s
Internet Era: Explosion in data collection, storage, and transmission. Data mining and
"big data" concepts emerged.
Big Data: Large, varied, and rapidly updating datasets. Tools like Hadoop and Spark
enabled distributed processing.
Generative AI: Tools like ChatGPT automate and simplify analytics, making advanced
techniques accessible without deep programming knowledge.
Example Visual:
css
19th Century Nightingale’s Rose Chart, Snow’s Map Public health reforms, epidemiology
12/29
Era Example Visualizations Impact
Each era brought new tools and techniques, making data analysis faster, broader, and
more accessible.
The field continues to evolve rapidly—lifelong learning and adaptability are essential for
anyone in data analytics.
13/29
Next Steps:
Explore hands-on exercises with spreadsheets, databases, and AI tools to experience the
evolution of data analytics firsthand.
Description: Gathering data from various sources, which may include databases, APIs,
spreadsheets, or web scraping.
Example: A retail analyst collects sales data from the company’s database and customer
reviews from online platforms.
Visual:
less
14/29
Description: Ensuring data is accurate, complete, and formatted for analysis. This
includes removing duplicates, correcting errors, and handling missing values.
Visual:
rust
Raw Data --> [Remove Duplicates] --> [Fix Errors] --> Clean Data
Description: Structuring and merging data from different sources into a usable format.
Example: Combining sales and marketing data to analyze the impact of promotions on
sales.
Visual:
css
Visual:
css
Example: Using regression analysis to predict next quarter’s sales based on historical
data.
Visual:
15/29
css
Visual:
css
Description: Keeping detailed records of analysis steps, decisions, and methods for
reproducibility and teamwork.
Example: Documenting how data was cleaned and which variables were used in a
model.
Visual:
css
Visualization Tools Tableau, Power BI, Matplotlib Creating charts, dashboards, visual
storytelling
16/29
Tool Type Examples Purpose
Big Data Tools Hadoop, Spark Processing and analyzing very large datasets
AI/AutoML Tools H2O AI, AutoML Automated machine learning and modeling
A. Technical Skills
B. Functional Skills
Analytical Thinking: Interpreting data, making sense of patterns, and drawing logical
conclusions.
C. Soft Skills
Curiosity: Asking the right questions and digging deeper into the data.
17/29
5. Types of Data Analytics
Diagnostic Explain why something happened "Why did sales drop in March?"
Prescriptive Suggest actions based on analysis "Which marketing strategy should we use?"
2. Clean Data: Remove duplicates, fix errors, and handle missing values.
4. Visualize Results: Create a dashboard showing sales trends by product and region.
css
18/29
Step Role/Responsibility Example Tool Key Skill
8. Real-World Examples
Retail: Predicting which products will be popular during the holiday season using past
sales data.
less
10. Conclusion
19/29
A data analyst is a detective for data, equipped with technical, analytical, and communication
skills. They play a crucial role in transforming raw data into actionable insights that drive
decisions in every industry—from retail and healthcare to finance and beyond.
A. Collecting Data
Description: Gather raw data from various sources (databases, surveys, sensors, web
logs).
Example: Netflix collects data on what users watch, when, on which device, and even
when they pause or stop a show.
Visual:
css
B. Cleaning Data
20/29
Description: Remove duplicates, fix errors, and handle missing values to ensure
accuracy.
Visual:
css
C. Analyzing Data
Example: A retail company analyzes quarterly sales data to identify peak buying times
and popular products.
Visual:
css
D. Visualizing Data
Description: Present findings using charts, graphs, and dashboards to make insights
accessible and actionable.
Example: A health agency uses a map to show regions with high vaccination rates.
Visual:
css
21/29
Type Purpose Example Use Case
Diagnostic Explain why something happened "Why did sales drop in March?"
Prescriptive Suggest actions based on analysis "Which marketing strategy should we use?"
Factor Analysis: Reduces many variables into a few underlying factors (e.g., combining
multiple satisfaction measures into one score).
Cohort Analysis: Groups data into cohorts (e.g., customers who joined in the same
month) to study behaviors over time.
Cluster Analysis: Segments data into groups with similar characteristics (e.g., customer
segmentation for marketing).
Time-Series Analysis: Analyzes data points collected over time to identify trends and
forecast future values (e.g., predicting product demand).
Monte Carlo Simulations: Models the probability of different outcomes, often used for
risk assessment.
Examples of Visualizations:
22/29
Line Chart: Showing sales trends over time.
Visual Example:
markdown
| Sales ($) |
| | ■
| | ■ ■
| | ■ ■
|___________|___________
Q1 Q2 Q3
6. Real-World Examples
Netflix: Uses data analytics to personalize viewing recommendations, driving over 75%
of viewer activity and increasing user engagement.
Retail: Analyzes sales data to optimize inventory, identify popular products, and plan
marketing campaigns.
Finance: Uses time-series and Monte Carlo simulations to forecast trends and manage
risk.
23/29
Step Description Example Visual Tool
8. Key Takeaways
Data analytics transforms raw data into actionable insights through a systematic
process.
Techniques like regression, clustering, and time-series analysis are widely used to extract
value from data.
Overview
24/29
Data analytics is a systematic approach to extracting insights from data. There are several
types of data analytics, each answering different questions and using different methods.
Robust analytics projects often combine multiple types to solve real-world problems and
support decision-making.
Key Steps:
Example:
Suppose you have sales data from different regions, each using different date formats. EDA
involves converting all dates to a standard format, checking for missing entries, and getting
a sense of the data’s distribution.
Visual:
css
2. Descriptive Analytics
Purpose:
To answer “What happened?” by summarizing historical data and identifying basic trends
and patterns.
Typical Methods:
25/29
Calculating measures like mean, median, mode, max, and min.
Example:
A grocery store owner wants to know the highest bill generated on a given day. Descriptive
analytics finds the maximum transaction value from the day’s data.
Visual Example:
Bar Chart: Sales by Day
markdown
| Sales ($) |
| | ■
| | ■ ■
| | ■■ ■■
|___________|____________
Mon Tue Wed
3. Diagnostic Analytics
Purpose:
To answer “Why did it happen?” by digging deeper into the data to uncover causes and
relationships.
Typical Methods:
Hypothesis testing
Correlation analysis
Regression analysis
Example:
A retail owner notices low sales on Mondays. Diagnostic analytics tests if this is due to most
customers shopping on weekends, perhaps using regression or correlation between days of
the week and sales.
26/29
Visual Example:
Scatter Plot: Sales vs. Day of Week
csharp
4. Predictive Analytics
Purpose:
To answer “What could happen?” by using historical data to forecast future outcomes.
Typical Methods:
Regression analysis
Example:
Netflix uses predictive analytics to recommend shows to users based on their past viewing
behavior.
Visual Example:
Line Chart: Forecasted Sales
markdown
| Sales ($) |
| | /\
| | / \
| | / \
|___________|____________
Jan Feb Mar Apr (future)
5. Prescriptive Analytics
27/29
Purpose:
To answer “What should we do?” by recommending actions and weighing the pros and cons
of different options.
Typical Methods:
Optimization algorithms
Simulation models
Cost-benefit analysis
Example:
A retailer wants to choose between different ad campaigns. Prescriptive analytics estimates
the likely sales and costs of each, helping decide which campaign to run.
Visual Example:
Stacked Area Chart: Impact of Different Campaigns Over Time
pgsql
Exploratory What’s in the data? Initial review of sales data Histograms, Boxplots
28/29
Type Key Question Example Scenario Typical Visuals
Descriptive What happened? Find max daily sales Bar, Line, Pie Charts
Predictive What could happen? Forecast next month’s sales Line, Area Charts
Prescriptive What should we do? Choose best ad campaign Stacked Area, Treemap
Key Takeaways
Exploratory analysis is the foundation—always start here to understand your data.
Predictive analytics uses past data to forecast the future; prescriptive analytics
recommends actions.
Data visualization is essential for communicating findings—choose the right chart for
your question and data type.
Real-world analytics projects often combine several types for robust, actionable insights.
29/29
Data Analytics Process
The Data Analytics Process: Detailed Notes with Examples
and Visuals
Example:
A social media content creator wants to increase the number of views on their reels.
Visual:
pgsql
Example:
“I want to increase the number of views on my reels by 5% in the next month.”
Visual:
csharp
1/61
[Metric: Number of Views]
[Target: +5% in 1 month]
Example:
A content creator gathers engagement data (likes, views, comments) from Instagram,
YouTube, and Facebook.
Visual:
less
Example:
Visual:
css
2/61
[Raw Data] → [Remove Errors] → [Standardize Formats] → [Clean Data]
Example:
Look for correlations (e.g., do reels with hashtags get more views?).
Visual:
markdown
| Views |
| | ■
| | ■ ■
| | ■■ ■■
|_______|____________
Mon Tue Wed
javascript
| Views |
| /\ /
| / \/
|_/______ Time
3/61
Purpose:
Make sense of the findings. Link the results back to your original question and goal.
Example:
Visual:
csharp
Example:
Create a dashboard showing daily views, best-performing reels, and engagement trends.
Visual:
Dashboard Example:
pgsql
+-------------------------------+
| [Line Chart: Views Over Time]|
| [Bar Chart: Views by Day] |
| [Table: Top Performing Reels]|
+-------------------------------+
4/61
Purpose:
Implement recommendations, then track if your actions lead to the desired outcome.
Example:
Visual:
css
Define Set clear question & outcome Increase reel views Goal Diagram
Problem/Goal
Set Metric Choose what/how to measure +5% views in 1 month KPI Target
Gather Data Collect from all sources Instagram, YouTube, Data Flow
Facebook Chart
Clean Data Fix errors, standardize Remove duplicates, fix Data Pipeline
dates
Interpret Results Link findings to question Weekends best for Insights Box
posting
5/61
Real-World Example: Netflix Recommendations
Data Collected: What users watch, when, device used, pauses, ratings, searches.
Impact: Over 75% of viewer activity is driven by these data-driven suggestions, boosting
user engagement and business success.
Visual:
css
Key Principle: Good visuals tell a clear story and highlight what matters most.
Key Takeaways
The data analytics process is stepwise: define the problem, set measurable goals, gather
and clean data, analyze, interpret, present, and act.
Each step builds on the previous, ensuring reliable and actionable insights.
Real-world examples like Netflix and retail analytics show the power of this process in
action.
6/61
Tip for Self-Learning:
Practice each step with a small project: pick a question, collect some data (even from your
own social media or daily activities), and walk through the process, using charts and
dashboards to communicate your findings.
A. Structured Data
Examples:
Survey forms
Features:
Visual:
pgsql
+--------+-------+-------+
| Name | Age | City |
+--------+-------+-------+
| Alice | 23 | Pune |
| Bob | 35 | Delhi |
+--------+-------+-------+
7/61
B. Semi-Structured Data
Definition: Data that has some organizational properties (like tags or hierarchies) but
does not fit rigid tables.
Examples:
Features:
Visual:
json
{
"id": "P001",
"name": "Gaming Monitor",
"specifications": {
"screen_size": "27 inch",
"refresh_rate": "165Hz"
}
}
C. Unstructured Data
Examples:
Features:
8/61
Requires advanced processing (NLP for text, image processing for visuals).
Visual:
mathematica
Examples:
Age of users
Prices of goods
Sales/revenue data
Stock prices
Temperature, rainfall
Features:
Visual:
diff
+-------+--------+
| Item | Price |
+-------+--------+
| Pen | 10 |
| Book | 150 |
+-------+--------+
9/61
B. Text (Qualitative) Data
Examples:
Customer reviews
Emails, memos
Blog posts
Features:
Visual:
arduino
C. Visual Data
Examples:
Product photos
Features:
Visual:
mathematica
10/61
[Image: Product photo]
[Video: Assembly line footage]
Nominal Categories with no order Red, Blue, Green Bar chart, Pie chart
Ordinal Categories with a natural order Low, Medium, Bar chart, Ordered bar
High chart
Continuous Measurable quantities, can take 1.5m, 2.3kg, Histogram, Boxplot, Line
any value 37.5°C chart
Scatterplots: Show relationships between two variables (e.g., height vs. weight).
Area Charts: Show cumulative totals over time (e.g., sales volume by region).
javascript
| Sales |
| /\ /
11/61
| / \/
|_/______ Time
5. Real-World Scenarios
Structured Data:
Retail sales spreadsheet – Easily filter to find top-selling products.
Semi-Structured Data:
Email dataset – Analyze sender/receiver patterns, but email body needs text processing.
Unstructured Data:
Social media images – Use image recognition to identify brand logos in posts.
6. Key Takeaways
Data can be structured, semi-structured, or unstructured based on organization.
The structure and content of your data determine which analysis and visualization
methods you can use.
Understanding your data type is the first step in any analytics project.
12/61
1. What is Data?
Data is any information—facts, statistics, or observations—that can help us make decisions
or uncover insights. In data analytics, understanding the type, structure, and format of your
data is a crucial first step before any analysis.
Definition: Data that represents categories or groups without any inherent order or
ranking.
Examples:
How to Analyze:
Frequency tables
Bar charts
Visual Example:
csharp
13/61
Definition: Data that represents categories with a clear order or ranking, but the
differences between the ranks are not necessarily equal or quantifiable.
Examples:
How to Analyze:
Visual Example:
less
Definition: Numeric data that can only take specific, separate values (usually whole
numbers).
Examples:
How to Analyze:
Visual Example:
sql
14/61
| 10 customers
| ■■■■■■■
| 15 customers
Definition: Numeric data that can take any value within a range, including fractions and
decimals.
Examples:
Height: 1.75 m
Temperature: 22.5°C
How to Analyze:
Visual Example:
mathematica
15/61
Visual Table:
A. Structured Data
Visual Example:
pgsql
+--------+-------+--------+
| Name | Age | City |
+--------+-------+--------+
| Alice | 23 | Pune |
| Bob | 35 | Delhi |
+--------+-------+--------+
B. Semi-Structured Data
Definition: Has some structure but not rigid tables (uses tags or keys).
Visual Example:
json
{
"name": "Alice",
"age": 23,
"city": "Pune"
}
16/61
C. Unstructured Data
Visual Example:
mathematica
pgsql
Name,Age,City
Alice,23,Pune
Bob,35,Delhi
17/61
7. Real-World Example: Netflix Data Analytics
Netflix collects a variety of data types:
css
18/61
9. Key Takeaways
Understanding data types (nominal, ordinal, discrete, continuous) is essential for correct
analysis and visualization.
The type and format of data determine which tools and statistical methods you should
use.
Real-world analytics projects, like those at Netflix, combine multiple data types and
formats to generate insights and drive decisions.
Data is no longer just numbers in spreadsheets; it now includes text, images, sensor
readings, and more, reflecting human behavior, choices, and experiences.
A. Relational Databases
19/61
Definition: Organized collections of structured data stored in tables (rows and columns)
with defined relationships.
Examples:
Customer transactions
Business activities
Use Case:
Linking customer purchase data with product details and payment methods to
analyze sales or plan promotions.
Visual:
less
Examples:
Use Case:
Visual:
20/61
css
F. Satellite Imagery
Definition: Images and data captured from satellites for various analyses.
Examples:
Weather forecasting
Use Case:
Visual:
css
A. First-Party Data
Examples:
Website analytics
Transaction records
B. Third-Party Data
21/61
Examples:
Economic indicators
Definition: Data gathered specifically for a project, often via surveys, interviews, or
observations.
Sample Data: A representative subset of the population, used when collecting all data is
impractical.
Importance: Sampling saves time and resources but must be done carefully to avoid
bias and ensure representativeness.
Visual:
scss
Types:
Databases (structured)
Data warehouses
22/61
Data lakes (for unstructured/semi-structured data)
Importance: Efficient data storage and retrieval streamline the analytics process and
support troubleshooting.
Visual:
css
A. Relational Databases
Advantages:
Minimize redundancy
Limitations:
B. Non-Relational Databases
Advantages:
23/61
1. Purchases: Customer ID, Product ID, Purchase Date
How It Works:
Link tables via Product ID and Customer ID to analyze sales, customer behavior, and
product performance.
Visual:
pgsql
8. Key Takeaways
Data sources are more diverse than ever: databases, files, APIs, web scraping, streams,
satellite imagery, and more.
Relational databases are ideal for structured data; non-relational databases handle
variety and volume.
The right choice of data source and storage depends on your data’s structure, intended
use, and analysis needs.
Self-Learning Tip:
Practice identifying and categorizing data sources in your daily life (e.g., your bank’s app,
24/61
online shopping, or social media), and sketch simple diagrams to visualize how data flows
from source to analysis.
Choosing the right repository is crucial for efficient data analysis, reporting, and
decision-making.
A. Relational Databases
Definition: Store data in structured tables with rows and columns; relationships between
tables are defined using keys.
Best for: Structured data with clear relationships (e.g., customer orders, inventory).
Visual:
pgsql
+-----------+ +-----------+
| Customers | | Orders |
+-----------+ +-----------+
| CustID || CustID |
| Name | | OrderID |
+-----------+ +-----------+
25/61
Definition: Designed for flexibility, scale, and speed; handle semi-structured or
unstructured data.
Types:
Example: User session data where each user ID (key) maps to session info
(value).
Visual:
css
Visual:
json
{
"order_id": 101,
"items": ["Pen", "Notebook"],
"total": 150
}
Column Store: Data stored in columns rather than rows (e.g., Cassandra).
Example: Time-series sensor data, where each column family holds readings for
a device.
Visual:
markdown
Graph Store: Data as nodes and edges to represent relationships (e.g., Neo4j).
Example: Social network where nodes are users and edges are friendships.
26/61
Visual:
less
[Alice]---(friend)---[Bob]
| |
(likes) (follows)
| |
[Post1] [Charlie]
Best for: High-volume, high-velocity, and varied data formats (e.g., social media, IoT,
product catalogs).
C. Data Warehouses
Definition: Large, centralized repositories that consolidate cleaned and structured data
from multiple sources for analysis and reporting.
Features:
Best for: Company-wide analytics, business intelligence, and historical trend analysis.
Visual:
less
D. Data Marts
27/61
Features:
Example: A sales data mart for the sales team, containing only sales-related data.
Visual:
csharp
[Data Warehouse]
|
[Sales Data Mart]
[HR Data Mart]
[Finance Data Mart]
E. Data Lakes
Definition: Storage repositories that can hold vast amounts of raw data in its original
format—structured, semi-structured, or unstructured.
Features:
Best for: Big data, machine learning, and scenarios where you want to retain all data for
future use.
Visual:
csharp
28/61
3. Choosing the Right Repository
NoSQL Database Flexible Big data, varied formats, fast Social media, IoT, product
access catalogs
Data Warehouse Structured Analytics, reporting, historical Company-wide BI, sales trend
trends analysis
Data Lake All Big data, raw/unprocessed Machine learning, storing all
formats data raw logs
4. Visual Summary
pgsql
5. Key Takeaways
Relational databases are best for structured, linked data.
29/61
NoSQL databases offer flexibility for varied, high-volume data.
Data warehouses centralize cleaned data for analysis; data marts focus on specific
needs.
Data lakes store all types of data in their native formats, ideal for big data and advanced
analytics.
The choice of repository depends on your data’s structure, volume, and intended use.
1. Introduction
As data analytics becomes central to business and research, handling data responsibly is
critical. This involves not just technical skills, but also understanding data privacy, data
security, and data ethics. These principles ensure individuals’ rights are protected and
organizations use data in trustworthy, lawful, and ethical ways.
2. Data Privacy
Definition:
Data privacy refers to the principles and regulations governing how personal data is
collected, used, stored, shared, and handled.
Why It Matters:
30/61
Ensures compliance with laws (e.g., GDPR, CCPA, DPDP)
When Is It a Concern?
When data can be used to uniquely identify an individual (not just aggregate statistics)
Personally Identifiable Information (PII): Data that can directly identify a person (e.g.,
name, phone number, email, Aadhaar/PAN number, date of birth)
Personal Information: Includes PII and other data that can be linked to a
person/household (e.g., IP address, geolocation, case numbers)
Sensitive Information: Data that, if leaked, can cause harm (e.g., genetic data,
political/religious beliefs)
Visual Example:
pgsql
+---------------------+---------------------+---------------------+
| PII | Personal Info | Sensitive Info |
|---------------------|---------------------|---------------------|
| Name, Email, | IP Address, | Genetic Info, |
| Phone, Aadhaar | Geolocation, | Political Beliefs, |
| | Video w/ Face | Religion |
+---------------------+---------------------+---------------------+
A. Data Anonymization
Methods:
Example Table:
pgsql
31/61
| Name | Age | City | → | ID123 | 25 | Delhi |
|----- |-----|---------| | ID124 | 32 | Mumbai |
Generalization: Reduce detail (e.g., replace full address with city, exact date of birth
with age).
Perturbation: Add small, random changes (“noise”) to data so exact values are
hidden but overall trends remain.
Example: Salary values are slightly altered for privacy, but average remains
accurate.
Data Masking: Hide part of sensitive data but keep its structure.
Visual:
yaml
Original: 1234-5678-9012
Masked: XXXX-XXXX-9012
4. Data Security
Definition:
Protecting data from unauthorized access, breaches, and misuse.
32/61
Collect only necessary data (minimize risk)
Visual:
csharp
[Data Repository]
|
[Access Controls]
|
[Only authorized users]
5. Data Ethics
Definition:
Moral principles guiding how organizations collect, use, and share data.
Key Principles:
Visual:
csharp
33/61
6. Legal Regulations
GDPR (Europe): Strict rules on consent and data use
Always check the relevant law for your region and data type.
8. Summary Table
9. Key Takeaways
Data privacy is about protecting individual identities in your data.
Data security ensures only authorized people can access and use data.
Data ethics guides you to use data in a fair, transparent, and responsible way.
34/61
Always comply with relevant data protection laws.
Start with your question: What problem are you trying to solve?
Determine required information: What specific details will help answer your question?
Be precise: Move from a vague idea to a detailed list of variables and data sources.
Example:
A retail company wants to run a targeted ad campaign for the festive season to boost sales.
Customer complaints
Visual:
css
35/61
[Business Question] → [List of Needed Data Types]
Visual:
less
[Customer DB] [Web Analytics] [Survey Tool] [Support Logs] [Social Media]
\ | | | /
\ | | | /
+-------------------------------------------------------------+
| Data Sources for Retail Ad Campaign |
+-------------------------------------------------------------+
36/61
3. Deciding on Data Collection Methods
Factors Influencing Methods:
Common Methods:
SQL Queries: For structured data in relational databases (e.g., customer profiles,
purchase history).
APIs: For web-based or platform data (e.g., social media feeds, some web analytics).
Web Scraping: For extracting information from web pages (e.g., competitor prices,
product reviews).
Data Streams: For real-time, continuously updating data (e.g., IoT sensors, live website
activity).
scss
Some data is static (e.g., past purchases), some is dynamic (e.g., live website visits).
Decide how often to collect or refresh each data type (e.g., hourly, daily, weekly).
Sample Size:
37/61
Sometimes, you can’t collect all data—choose a representative subset (sample).
Larger samples generally yield more reliable insights, but take more resources to
process.
Visual:
scss
Steps:
Import unstructured data (e.g., text, images) into data lakes or specialized repositories.
Use ETL tools or data pipelines (Extract, Transform, Load) to automate and manage the
import process.
Tools:
css
Repository types:
38/61
NoSQL DB (for semi-structured)
2. Locate:
Profiles in company DB
3. Plan Collection:
4. Import:
Use ETL pipeline to bring all datasets into a central data warehouse or data lake.
css
39/61
7. Key Takeaways
Identifying data needs is driven by your project’s main question.
Sources and methods vary by data type, format, and update frequency.
Tools and techniques (SQL, APIs, ETL, web scraping) are chosen based on data structure
and source.
csharp
[Define Question]
↓
[Identify Data Needed]
↓
[Locate & Gather Data]
↓
[Import to Repository]
↓
[Ready for Cleaning & Analysis]
This systematic approach ensures you start your data analytics project with the right, well-
organized data foundation.
40/61
1. Why Data Cleaning Matters
Definition: Data cleaning is the process of correcting or removing inaccurate,
incomplete, or irrelevant data from a dataset to ensure quality analysis.
Importance: Unclean data can lead to misleading results, faulty conclusions, and poor
business decisions. Cleaning is often the most time-consuming yet crucial step in data
analytics.
How:
Compare against rules (e.g., all entries in a "date" column should be valid dates).
Visual Example:
sql
41/61
B. Cleaning the Data
Common Tasks:
Remove Duplicates: Keep only unique records (e.g., one transaction per transaction
ID).
Format Data Consistently: Ensure all entries of a variable have the same data type
and format (e.g., all dates in DD-MM-YYYY).
Standardize Units: Convert all measurements to the same unit (e.g., all prices in
INR, all weights in kg).
Fix Typos and Inconsistencies: Standardize spelling, remove extra spaces, unify
abbreviations (e.g., "UP" and "Uttar Pradesh" to "Uttar Pradesh").
Handle Missing Values: Decide whether to drop, fill, or impute missing data.
Visual Example:
lua
Before Cleaning:
| Name | State |
|--------|--------------|
| Priya | Uttar Pradesh|
| Rahul | UP |
| Bhavna | Rajastan | ← Typo!
| Amit | Rajasthan |
| Suman | Rajasthan |
| Suman | Rajasthan | ← Duplicate
After Cleaning:
| Name | State |
|--------|--------------|
| Priya | Uttar Pradesh|
| Rahul | Uttar Pradesh|
| Bhavna | Rajasthan |
| Amit | Rajasthan |
| Suman | Rajasthan |
42/61
C. Verifying the Data
Goal: Confirm that all issues identified during inspection have been addressed.
How:
Visual Example:
css
Remove duplicates, format cells, find/replace, trim spaces, spell check, pivot tables.
43/61
Google DataPrep:
Automate cleaning, handle large datasets, merge, transform, impute, and validate
data.
R (dplyr, data.table):
Visual:
css
Step 2: Clean
Step 3: Verify
44/61
6. Visual Workflow Summary
csharp
[Raw Data]
↓
[Inspect: Spot errors, visualize]
↓
[Clean: Remove duplicates, fix formats, handle missing]
↓
[Verify: Re-check, summarize, document]
↓
[Analysis-Ready Data]
7. Key Takeaways
Data cleaning is essential for reliable analytics.
Common tasks: Remove duplicates, standardize formats/units, fix typos, handle missing
values.
Use tools like Excel, Python (pandas), R, and Google DataPrep for efficient cleaning.
45/61
Data Cleaning: Detailed Notes with Examples and Visuals
How: Identify records with the same unique identifier (e.g., transaction ID, customer ID)
and keep only one.
Example Table:
Visual:
css
46/61
How:
Convert column names to a consistent style (e.g., replace spaces with underscores,
use title case).
Python Example:
python
Visual:
yaml
Why: Missing values are common and, if not handled, can bias results or break analysis.
How:
Fill (Impute): Use a placeholder (e.g., 'NA', 'Not Applicable'), mean/median for
numeric, or a special value.
Examples:
Visual:
css
47/61
[Raw Data: Missing Values] → [Impute or Drop] → [Clean Data]
Why: Errors (like negative ages) and inconsistencies (like mismatched values across
datasets) can mislead analysis.
How:
Use common sense and domain knowledge to spot impossible or unlikely values
(e.g., age = -5 or 150).
Examples:
Visual:
css
How:
Continuous Variables:
Standardization (Z-score):
xi −xˉ
z=
s
Min-Max Scaling:
xi −min(x)
x′ =
max(x)−min(x)
48/61
Log Transformation:
x′ = log(xi )
Categorical Variables:
Gender_Male Gender_Female
1 0
0 1
Visual:
mathematica
How:
Use unique keys (e.g., customer ID, school code) to join datasets.
Example Visual:
pgsql
49/61
3. Handling Missing Data: Key Considerations
Understand the reason: Missingness may be due to privacy, non-response,
inapplicability, or technical errors.
Consistent indicators: Use a standard placeholder (e.g., 'NA') for all missing entries.
Be cautious: Deleting records with missing data can bias results, especially if
missingness is not random.
Example:
2. Standardize formats: All dates in DD-MM-YYYY, all state names in full form.
4. Fix errors:
5. Transform variables:
Visual Workflow:
csharp
50/61
[Raw Data]
↓
[Remove Duplicates]
↓
[Standardize Formats]
↓
[Handle Missing Values]
↓
[Fix Errors]
↓
[Transform/Encode]
↓
[Merge Datasets]
↓
[Analysis-Ready Data]
51/61
Unlocks hidden insights: Extracts more information from your data without collecting
new data.
Improves model performance: Well-crafted features can make predictive models more
accurate.
Engineered Features:
Why: Retailers may see more transactions on weekends; knowing the day can reveal
sales patterns.
Visual:
2025-02-03 Monday No
B. Mathematical Transformations
Why: BMI is more informative for health analysis than height or weight alone.
52/61
Visual:
C. Aggregating Data
Engineered Feature: Total Amount = Sum of all product amounts in that transaction
Visual:
53006 40 60 100
53668 80 80
Engineered Feature: Age Group (e.g., 0-18: Child, 19-35: Young Adult, 36-60: Adult, 60+:
Senior)
Visual:
53/61
Age Age Group
17 Child
28 Young Adult
62 Senior
E. Combining Features
Date Level: Total sales per day, number of transactions per day
css
[Date]
└── [Transaction]
└── [Product 1]
└── [Product 2]
54/61
Long Format: Each row is a single observation (e.g., one product in one transaction).
Wide Format: Each row is a higher-level unit (e.g., one transaction), with multiple
columns for each sub-feature (e.g., Amount 1, Amount 2, ...).
Visual Example:
Long Format:
53006 Banana 40
53006 Grapes 60
Wide Format:
55/61
Raw Feature(s) Engineered Feature Why Useful?
9. Key Takeaways
Feature engineering transforms raw data into more meaningful features.
Well-engineered features can uncover hidden patterns and improve analytics or machine
learning results.
Practice by looking for new features you can create from any dataset you work with.
56/61
layer of insight from your dataset.
Summary Statistics:
Frequency Counts: How many times each value or category appears. Example: If
“banana” is purchased 4 times, its frequency is 4.
Range:
Date of purchase
Transaction ID
Product ID
Product name
Amount
57/61
Transaction ID Product Amount (₹)
1001 Banana 22
1002 Grapes 56
1004 Banana 22
1005 Potato 56
Product Frequency
Banana 2
Grapes 1
Potato 2
Insights:
Payment Mode Analysis: Combine transaction data with a payment mode dataset
to analyze trends by payment type.
58/61
Combining Data Sets:
Joining datasets (e.g., transaction details with payment modes) allows for deeper
insights, such as average transaction value by payment method.
1001 22 UPI
1002 56 Cash
UPI 31.3
Cash 24.3
Reveals patterns, trends, and outliers that summary statistics may miss.
Labeled Axes: Both axes should be labeled with variable names and units.
59/61
Aesthetics: Clean, uncluttered, and uses color purposefully.
lua
| Frequency
| ■■■■■■■■■■■■■
| ■■■■■■■■■■■■■
| ■■■■■■■
|-------------------------
Potato Banana Grapes
java
Four datasets can have identical summary statistics (mean, median, etc.) but look very
different when graphed.
Lesson: Always visualize your data; it reveals patterns that numbers alone can’t.
60/61
3. Combine Datasets:
5. Visualize:
5. Key Takeaways
Descriptive analytics provides the first, essential look at your data.
61/61