Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
37 views90 pages

BDA Notes-1

Uploaded by

sgamingp.sg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views90 pages

BDA Notes-1

Uploaded by

sgamingp.sg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Text editing request

Basics of Data Analytics: Detailed Notes with Examples and


Visuals

Introduction to Data Analytics

Data analytics is the systematic process of gathering, cleaning, organizing, analyzing,


and interpreting data to uncover patterns and insights that inform real-world decisions.

We live in a "data revolution" era where enormous amounts of data are generated daily
from activities like social media, online transactions, sports, healthcare, and
entertainment.

"Until a few years ago, there were five exabytes of information created from the dawn of
civilization through 2003. Now that much information is created every two days." — Eric
Schmidt, former Google executive chairman

The Data Analytics Pipeline


1. Gathering (Collecting) Data
Definition: The process of collecting data from various sources relevant to the question
you want to answer.

Examples of Data Sources:

Social media clicks and posts

Online purchase transactions

Sports performance statistics

Medical prescriptions

Retail bills

Streaming platform usage (e.g., Netflix)

Video game logs

1/29
Visual Example:

lua

+---------------------+ +-------------------+ +-------------------+


| Social Media Clicks | --> | Data Warehouse | [Cleaning Process] --> Clean,
Organized Data

3. Analyzing Data
Definition: Using statistical and mathematical techniques to identify patterns, trends,
and relationships in the data.

Types of Analysis:

Descriptive analysis (summarizing data)

Correlation analysis (identifying relationships)

Predictive analysis (forecasting future values)

Example:

Descriptive: "Average purchase amount per user is $18."

Correlation: "Users who watch more than 10 videos per week are 30% more likely to
make a purchase."

Predictive: "Forecast sales for next quarter using historical data."

Visual Example:

markdown

[Bar Chart] - Sales by Month

| Sales ($) |
| | ■
| | ■ ■
| | ■ ■ ■
|___________|____________
Jan Feb Mar

2/29
Tools: Statistical models, data visualization (bar charts, pie charts), sometimes AI for
advanced predictions.

4. Interpreting and Reporting Insights


Definition: Presenting the results of the analysis in a clear, compelling way to support
decision-making.

Methods:

Data visualizations (charts, graphs)

Dashboards (real-time tracking of metrics)

Storytelling with data

Example:

Dashboard showing daily active users, sales, and engagement rates.

Visual Example:

pgsql

+---------------------+ +-------------------+ +------------------+


| Line Chart: | | Pie Chart: | | Table: |
| User Growth | | Sales by Region | | Top Products |
+---------------------+ +-------------------+ +------------------+

Goal: Make insights understandable to someone who wasn't involved in the analysis and
enable real-world action.

Summary Table: Steps in Data Analytics

Step Purpose Example

Gathering Data Collect relevant data from various Downloading sales and web traffic
sources logs

3/29
Step Purpose Example

Cleaning & Remove errors, standardize, group Fixing date formats, removing
Organizing data duplicates

Analyzing Data Find patterns, trends, relationships Calculating average sales per region

Interpreting Results Present insights for decision-making Creating a dashboard for managers

Key Takeaways
Data analytics transforms raw data into actionable insights through a systematic
process.

Each step—gathering, cleaning, analyzing, and interpreting—is essential for reliable


results.

Real-world examples include business sales analysis, sports performance tracking, and
social media engagement optimization.

Next Steps:
In future lessons, you will explore each stage in greater detail and gain hands-on experience
with real datasets and tools.

Who Uses Data Analytics? Sectors, Examples, and Visuals


Data analytics is not limited to a single field—almost every sector today uses data to drive
decisions and improve outcomes. Below are detailed notes with real-world examples and
accompanying visuals to illustrate how different industries leverage data analytics.

1. Business
How Data Analytics is Used:

4/29
Analyzing past performance (e.g., profits, sales trends)

Making informed strategic decisions (e.g., launching new products, setting targets)

Evaluating the effectiveness of marketing campaigns

Example:

A company reviews sales data from the last quarter to decide whether to launch a new
ad campaign. They analyze which products sold best and when, helping them choose
the optimal time for the campaign.

Visual: Business Decision Flow

css

[Sales Data] → [Analysis: Best-Selling Products] → [Insight: Launch Time] →


[Decision: Start Campaign]

2. Manufacturing
How Data Analytics is Used:

Managing and optimizing supply chains

Inventory control (avoiding shortages or overstock)

Workforce allocation (assigning workers where needed)

Predictive maintenance (anticipating equipment failures)

Example:

A factory uses analytics to predict when a machine is likely to break down, allowing for
maintenance before a costly failure occurs.

Visual: Manufacturing Analytics

css

[Sensor Data] → [Predictive Model] → [Alert: Maintenance Needed] → [Prevent


Downtime]

5/29
3. Retail
How Data Analytics is Used:

Understanding customer buying patterns

Managing stock levels and inventory

Identifying popular products and trends

Personalizing recommendations and marketing

Example:

Amazon uses customer purchase history and browsing patterns to recommend products
and optimize delivery routes.

Visual: Retail Analytics

css

[Customer Transactions] → [Trend Analysis] → [Stock Optimization & Personalized


Offers]

4. Healthcare
How Data Analytics is Used:

Tracking disease prevalence and predicting outbreaks

Improving patient care and operational efficiency

Analyzing prescriptions and medical test data

Developing new drugs and treatments

Example:

During the COVID-19 pandemic, analytics was used to monitor and predict outbreaks,
helping hospitals prepare and allocate resources.

Visual: Healthcare Analytics

css

6/29
[Patient Data] → [Pattern Recognition] → [Outbreak Prediction] → [Resource
Allocation]

5. Education
How Data Analytics is Used:

Tracking student progress and identifying learning difficulties

Evaluating curriculum effectiveness

Personalizing learning experiences

Example:

Schools use platforms like Intellischool to visualize student performance, identify those
at risk, and tailor interventions.

Visual: Education Analytics

css

[Student Scores] → [Progress Visualization] → [Identify At-Risk Students] →


[Targeted Support]

6. Banking
How Data Analytics is Used:

Detecting and preventing fraud

Assessing risk and creditworthiness

Personalizing financial products

Predicting customer churn and maximizing customer lifetime value

Example:

7/29
Banks use predictive analytics to identify potentially fraudulent transactions in real-time,
reducing losses and increasing security.

Visual: Banking Analytics

css

[Transaction Data] → [Anomaly Detection Model] → [Fraud Alert] → [Prevent Loss]

Summary Table: Sectors and Data Analytics Applications


Sector Key Uses of Data Analytics Example Use Case

Business Strategic planning, marketing, decision-making Launching new products

Manufacturing Supply chain, inventory, maintenance Predictive equipment


maintenance

Retail Customer insights, inventory, recommendations Personalized product


suggestions

Healthcare Patient care, outbreak prediction, resource COVID-19 monitoring and


allocation response

Education Student progress, curriculum evaluation Identifying struggling students

Banking Fraud detection, risk, personalization Real-time fraud alerts

Key Takeaways
Data analytics is integral across sectors, enabling better, data-driven decisions.

Each industry adapts analytics to its unique challenges—be it predicting machine


failures, optimizing marketing, or improving patient care.

The process typically involves collecting data, analyzing it for patterns, and using those
insights for targeted actions.

8/29
Visual Summary: Data Analytics Across Sectors

less

[Business] [Manufacturing] [Retail] [Healthcare] [Education]


[Banking]
| | | | |
|
+----------------+------------------+---------------+-----------------+--------
--------+
[Data Analytics]
|
[Informed Decisions]

Data analytics is the backbone of modern decision-making, transforming raw data into
actionable insights across every major sector of the economy.

The Evolution of Data Analytics: Detailed Notes, Examples,


and Visuals

What is Data Analytics?


Data analytics is the science and art of using data to make real-world decisions. It involves
collecting, inspecting, cleaning, transforming, and modeling data to discover useful
information, draw conclusions, and support decision-making.

A Brief History of Data Analytics


1. Ancient and Early Human Use
Early Humans: Used pebbles, marks on stones, and bones to track days and seasons—
helping decide when to plant crops or go hunting.

Example Visual:

css

9/29
[Stone with tally marks] → [Track days for planting]

Ancient Civilizations:

Egypt: Census and tax records for administration.

Babylon: Clay tablets for agricultural yields and celestial events.

Greece: Philosophers like Aristotle analyzed social and natural phenomena.

Rome: Data collection for public administration and military logistics.

2. Middle Ages to Renaissance


Fibonacci: Introduced the Fibonacci sequence, used in finance and biology.

John Graunt: Analyzed mortality data in London, foundational for public health statistics.

3. The 17th–18th Centuries: Systematic Approaches


John Napier (1614): Invented logarithms, revolutionizing calculations.

John Graunt (1662): Published statistical analysis of mortality data.

Emergence of Statistical Visualization:

Example Visual (1644):


Michael Florent Van Langren’s line graph showing longitude estimates—a shift from
tables to visual graphs.

pgsql

[Line graph of longitude estimates by astronomers]

4. The 19th Century: The Golden Age of Statistical Graphics


Florence Nightingale: Used the "Rose Chart" (Coxcomb Chart) to show causes of
mortality in the Crimean War, influencing public health reforms.

10/29
John Snow: Mapped cholera outbreaks in London, identifying contaminated water
sources.

Charles Minard: Created a famous chart of Napoleon’s Russian campaign, visualizing


multiple data dimensions.

Example Visuals:

Nightingale’s Rose Chart:

csharp

[Circular chart with colored segments showing causes of death]

John Snow’s Cholera Map:

javascript

[Map with dots showing cholera cases clustered around a water pump]

5. The 20th Century: The Computer Age


1950s–60s: Mainframe computers (e.g., ENIAC) enabled processing of large datasets,
used by governments and banks.

Example: US Census Bureau used computers to process census data much faster
than by hand.

1970s–80s: Rise of business intelligence, relational databases (RDBMS by IBM), and


decision support systems.

Example Visual:

css

[Mainframe Computer] → [Census Data Processing]

Statistical Software: SPSS, SAS, and the spread of spreadsheets (Excel) democratized
data analysis.

Relational Databases & SQL: Efficient storage and retrieval of complex data.

11/29
6. The Digital Revolution: 1990s–2000s
Internet Era: Explosion in data collection, storage, and transmission. Data mining and
"big data" concepts emerged.

Example: E-commerce sites collecting user behavior data to optimize sales.

Big Data: Large, varied, and rapidly updating datasets. Tools like Hadoop and Spark
enabled distributed processing.

7. The 21st Century: AI and Real-Time Analytics


AI & Machine Learning: Deep learning, image recognition, natural language processing,
and recommendation systems.

Internet of Things (IoT): Devices generating continuous data streams (e.g.,


smartwatches, sensors).

Cloud Computing: Scalable, cost-effective data storage and analysis.

Generative AI: Tools like ChatGPT automate and simplify analytics, making advanced
techniques accessible without deep programming knowledge.

Example Visual:

css

[Smartphone] → [Health Data Stream] → [Cloud AI Analysis] → [Personalized


Health Insights]

Key Milestones in Data Visualization


Era Example Visualizations Impact

Ancient Maps, tally marks Navigation, agriculture, time tracking

17th Century Van Langren’s line graph Statistical comparison

19th Century Nightingale’s Rose Chart, Snow’s Map Public health reforms, epidemiology

12/29
Era Example Visualizations Impact

20th Century Computer-generated charts, dashboards Business intelligence, decision support

21st Century Interactive dashboards, AR/VR Real-time analytics, immersive exploration

Summary Visual: The Evolution of Data Analytics


csharp

[Ancient Tally Marks]



[Maps & Early Charts]

[Statistical Graphs]

[Mainframe Computers]

[Spreadsheets & Databases]

[Internet & Big Data]

[AI, IoT, Cloud, Generative AI]

Conclusion & Takeaways


Data analytics has evolved from simple manual tracking to advanced AI-driven insights.

Each era brought new tools and techniques, making data analysis faster, broader, and
more accessible.

Today, data analytics is central to decision-making in every sector, powered by cloud


computing, IoT, and AI.

The field continues to evolve rapidly—lifelong learning and adaptability are essential for
anyone in data analytics.

13/29
Next Steps:
Explore hands-on exercises with spreadsheets, databases, and AI tools to experience the
evolution of data analytics firsthand.

Data Analyst Roles, Responsibilities, Tools, and Skills:


Detailed Notes with Examples and Visuals

1. What Does a Data Analyst Do?


A data analyst is responsible for turning raw data into meaningful insights that inform real-
world decisions. Their work spans the entire data analytics pipeline, from collecting data to
presenting actionable findings.

2. Core Responsibilities of a Data Analyst

A. Data Collection and Acquisition

Description: Gathering data from various sources, which may include databases, APIs,
spreadsheets, or web scraping.

Example: A retail analyst collects sales data from the company’s database and customer
reviews from online platforms.

Visual:

less

[Database] [API] [Web]


\ | /
\ | /
+----------------+
| Data Analyst |
+----------------+

B. Data Cleaning and Preparation

14/29
Description: Ensuring data is accurate, complete, and formatted for analysis. This
includes removing duplicates, correcting errors, and handling missing values.

Example: Cleaning a sales dataset by removing duplicate transactions and standardizing


date formats.

Visual:

rust

Raw Data --> [Remove Duplicates] --> [Fix Errors] --> Clean Data

C. Data Organization and Integration

Description: Structuring and merging data from different sources into a usable format.

Example: Combining sales and marketing data to analyze the impact of promotions on
sales.

Visual:

css

[Sales Data] + [Marketing Data] --> [Integrated Dataset]

D. Data Analysis and Exploration

Description: Using statistical techniques to identify trends, patterns, and relationships in


the data.

Example: Analyzing customer purchase patterns to identify peak shopping times.

Visual:

css

[Clean Data] --> [Statistical Analysis] --> [Trends & Patterns]

E. Advanced Analytics and Modeling

Description: Applying advanced methods such as regression, predictive modeling, or


machine learning to forecast outcomes or explain relationships.

Example: Using regression analysis to predict next quarter’s sales based on historical
data.

Visual:

15/29
css

[Historical Data] --> [Regression Model] --> [Sales Forecast]

F. Data Visualization and Reporting

Description: Presenting findings through charts, graphs, dashboards, and written


reports to communicate insights effectively.

Example: Creating a dashboard in Tableau to show monthly sales trends.

Visual:

css

[Analysis Results] --> [Bar Chart / Line Graph / Dashboard]

G. Documentation and Workflow Management

Description: Keeping detailed records of analysis steps, decisions, and methods for
reproducibility and teamwork.

Example: Documenting how data was cleaned and which variables were used in a
model.

Visual:

css

[Analysis Steps] --> [Documentation] --> [Team Collaboration]

3. Essential Tools for Data Analysts

Tool Type Examples Purpose

Spreadsheets Excel, Google Sheets Data entry, basic analysis, visualization

Statistical Python (pandas, scikit-learn), R Advanced analysis, modeling, automation


Software

Visualization Tools Tableau, Power BI, Matplotlib Creating charts, dashboards, visual
storytelling

16/29
Tool Type Examples Purpose

Databases SQL, MySQL, NoSQL Storing, querying, and managing large


(MongoDB) datasets

Big Data Tools Hadoop, Spark Processing and analyzing very large datasets

AI/AutoML Tools H2O AI, AutoML Automated machine learning and modeling

4. Key Skills for Data Analysts

A. Technical Skills

Statistics & Probability: Understanding distributions, averages, correlations, and


hypothesis testing.

Programming: Python, R, SQL for data manipulation and analysis.

Data Visualization: Ability to create clear and compelling visuals.

Database Management: Querying and maintaining data in databases.

Machine Learning (Advanced): Building predictive models.

B. Functional Skills

Analytical Thinking: Interpreting data, making sense of patterns, and drawing logical
conclusions.

Problem Solving: Tackling unique challenges in each analysis project.

Attention to Detail: Spotting errors, inconsistencies, and subtle trends in data.

C. Soft Skills

Curiosity: Asking the right questions and digging deeper into the data.

Tenacity: Persisting through challenges, especially when data is messy.

Creativity: Finding new ways to solve problems and visualize data.

Communication: Explaining technical findings to non-technical stakeholders.

Project Management: Organizing tasks, managing timelines, and collaborating with


teams.

17/29
5. Types of Data Analytics

Type Purpose Example Use Case

Descriptive Summarize past data "What were last quarter's sales?"

Diagnostic Explain why something happened "Why did sales drop in March?"

Predictive Forecast future outcomes "What will sales be next quarter?"

Prescriptive Suggest actions based on analysis "Which marketing strategy should we use?"

6. Example Workflow: Retail Sales Analysis


1. Collect Data: Gather sales, inventory, and customer data from databases and online
sources.

2. Clean Data: Remove duplicates, fix errors, and handle missing values.

3. Analyze Data: Use regression analysis to find factors affecting sales.

4. Visualize Results: Create a dashboard showing sales trends by product and region.

5. Report Insights: Present findings to management to inform inventory planning.

Visual: Retail Sales Analytics Pipeline

css

[Collect Data] → [Clean Data] → [Analyze Data] → [Visualize] → [Report]

7. Summary Table: Data Analyst Roles, Tools, and Skills

Step Role/Responsibility Example Tool Key Skill

Data Collection Gather data SQL, Web Database Management


Scraping

18/29
Step Role/Responsibility Example Tool Key Skill

Data Cleaning Prepare data Excel, Python Attention to Detail

Data Analysis Identify patterns R, Python Statistics, Analytical

Advanced Modeling Predict outcomes scikit-learn, H2O Machine Learning

Visualization/Reporting Present findings Tableau, Power Communication,


BI Visualization

Documentation/Workflow Track process Workflow tools Project Management

8. Real-World Examples
Retail: Predicting which products will be popular during the holiday season using past
sales data.

Healthcare: Identifying which treatments are most effective by analyzing patient


outcomes.

Finance: Detecting fraudulent transactions by analyzing spending patterns.

9. Visual Recap: The Data Analyst’s Toolkit

less

[Statistics] [Programming] [Visualization] [Databases] [Machine Learning]


| | | | |
+-------------+--------------+--------------+-------------+
|
[Data Analyst]

10. Conclusion

19/29
A data analyst is a detective for data, equipped with technical, analytical, and communication
skills. They play a crucial role in transforming raw data into actionable insights that drive
decisions in every industry—from retail and healthcare to finance and beyond.

Tip for Self-Learning:


Practice each step of the data analytics pipeline using real datasets and experiment with
different tools to build both technical and soft skills.

Data Analytics: Detailed Notes with Examples and Visuals

1. What is Data Analytics?


Data analytics is the process of inspecting, cleaning, transforming, and modeling data with
the goal of discovering useful information, drawing conclusions, and supporting decision-
making. It is widely used across industries to inform strategy, improve efficiency, and gain a
competitive edge.

2. The Data Analytics Process

A. Collecting Data

Description: Gather raw data from various sources (databases, surveys, sensors, web
logs).

Example: Netflix collects data on what users watch, when, on which device, and even
when they pause or stop a show.

Visual:

css

[User Activity] → [Netflix Database]

B. Cleaning Data

20/29
Description: Remove duplicates, fix errors, and handle missing values to ensure
accuracy.

Example: Removing duplicate sales entries or correcting inconsistent date formats in a


sales dataset.

Visual:

css

Raw Data → [Remove Duplicates] → [Fix Errors] → Clean Data

C. Analyzing Data

Description: Use statistical and computational techniques to identify trends, patterns,


and relationships.

Example: A retail company analyzes quarterly sales data to identify peak buying times
and popular products.

Visual:

css

[Clean Data] → [Statistical Analysis] → [Trends & Patterns]

D. Visualizing Data

Description: Present findings using charts, graphs, and dashboards to make insights
accessible and actionable.

Example: A health agency uses a map to show regions with high vaccination rates.

Visual:

css

[Analysis Results] → [Bar Chart / Map / Dashboard]

3. Types of Data Analytics

21/29
Type Purpose Example Use Case

Descriptive Summarize past data "What were last quarter's sales?"

Diagnostic Explain why something happened "Why did sales drop in March?"

Predictive Forecast future outcomes "What will sales be next quarter?"

Prescriptive Suggest actions based on analysis "Which marketing strategy should we use?"

4. Common Data Analytics Techniques


Regression Analysis: Examines relationships between variables (e.g., how advertising
spend affects sales).

Factor Analysis: Reduces many variables into a few underlying factors (e.g., combining
multiple satisfaction measures into one score).

Cohort Analysis: Groups data into cohorts (e.g., customers who joined in the same
month) to study behaviors over time.

Cluster Analysis: Segments data into groups with similar characteristics (e.g., customer
segmentation for marketing).

Time-Series Analysis: Analyzes data points collected over time to identify trends and
forecast future values (e.g., predicting product demand).

Monte Carlo Simulations: Models the probability of different outcomes, often used for
risk assessment.

5. Data Visualization: Making Sense of Data


Definition:
Data visualization is the representation of data using visual tools like charts, graphs, and
maps. It helps people quickly identify patterns, trends, and outliers in large datasets.

Examples of Visualizations:

Bar Chart: Comparing sales across regions.

22/29
Line Chart: Showing sales trends over time.

Pie Chart: Displaying market share by product.

Map: Highlighting regional differences in vaccination rates.

Dashboard: Combining multiple visualizations for real-time insights.

Visual Example:

markdown

[Bar Chart: Sales by Quarter]

| Sales ($) |
| | ■
| | ■ ■
| | ■ ■
|___________|___________
Q1 Q2 Q3

6. Real-World Examples
Netflix: Uses data analytics to personalize viewing recommendations, driving over 75%
of viewer activity and increasing user engagement.

Retail: Analyzes sales data to optimize inventory, identify popular products, and plan
marketing campaigns.

Healthcare: Visualizes disease prevalence on maps to inform public health decisions.

Finance: Uses time-series and Monte Carlo simulations to forecast trends and manage
risk.

7. Summary Table: Data Analytics Workflow

Step Description Example Visual Tool

Collect Gather data from sources Netflix user activity Database

23/29
Step Description Example Visual Tool

Clean Remove errors and Remove duplicate sales entries Spreadsheet


inconsistencies

Analyze Identify patterns and Find peak shopping times Statistical


relationships Model

Visualize Present findings visually Map of vaccinated regions Charts, Maps

Report/Act Share insights for decision- Recommend new marketing Dashboard


making strategy

8. Key Takeaways
Data analytics transforms raw data into actionable insights through a systematic
process.

Visualization is essential for communicating complex findings in an accessible way.

Techniques like regression, clustering, and time-series analysis are widely used to extract
value from data.

Real-world applications span entertainment, retail, healthcare, and finance.

Tip for Self-Learning:


Practice each step using real or sample datasets. Try creating your own charts using Excel,
Python (Matplotlib/Seaborn), or Tableau to reinforce your understanding.

Types of Data Analytics: Detailed Notes with Examples and


Visuals

Overview

24/29
Data analytics is a systematic approach to extracting insights from data. There are several
types of data analytics, each answering different questions and using different methods.
Robust analytics projects often combine multiple types to solve real-world problems and
support decision-making.

1. Exploratory Data Analysis (EDA)


Purpose:
To explore and understand the basic characteristics, structure, and patterns in a dataset
before formal analysis.

Key Steps:

Gather data from multiple sources and formats.

Sort, review, and categorize data.

Standardize formats (e.g., unify date formats).

Perform basic cleaning (remove errors, handle missing values).

Example:
Suppose you have sales data from different regions, each using different date formats. EDA
involves converting all dates to a standard format, checking for missing entries, and getting
a sense of the data’s distribution.

Visual:

css

[Raw Data] → [Standardize Dates] → [Remove Errors] → [Clean, Organized Data]

2. Descriptive Analytics
Purpose:
To answer “What happened?” by summarizing historical data and identifying basic trends
and patterns.

Typical Methods:

25/29
Calculating measures like mean, median, mode, max, and min.

Creating simple visualizations (bar charts, line graphs, pie charts).

Producing summary reports and dashboards.

Example:
A grocery store owner wants to know the highest bill generated on a given day. Descriptive
analytics finds the maximum transaction value from the day’s data.

Visual Example:
Bar Chart: Sales by Day

markdown

| Sales ($) |
| | ■
| | ■ ■
| | ■■ ■■
|___________|____________
Mon Tue Wed

3. Diagnostic Analytics
Purpose:
To answer “Why did it happen?” by digging deeper into the data to uncover causes and
relationships.

Typical Methods:

Hypothesis testing

Correlation analysis

Regression analysis

Example:
A retail owner notices low sales on Mondays. Diagnostic analytics tests if this is due to most
customers shopping on weekends, perhaps using regression or correlation between days of
the week and sales.

26/29
Visual Example:
Scatter Plot: Sales vs. Day of Week

csharp

[Scatter plot showing lower sales points on Mondays]

4. Predictive Analytics
Purpose:
To answer “What could happen?” by using historical data to forecast future outcomes.

Typical Methods:

Time series forecasting

Machine learning models

Regression analysis

Example:
Netflix uses predictive analytics to recommend shows to users based on their past viewing
behavior.

Visual Example:
Line Chart: Forecasted Sales

markdown

| Sales ($) |
| | /\
| | / \
| | / \
|___________|____________
Jan Feb Mar Apr (future)

5. Prescriptive Analytics

27/29
Purpose:
To answer “What should we do?” by recommending actions and weighing the pros and cons
of different options.

Typical Methods:

Optimization algorithms

Simulation models

Cost-benefit analysis

Example:
A retailer wants to choose between different ad campaigns. Prescriptive analytics estimates
the likely sales and costs of each, helping decide which campaign to run.

Visual Example:
Stacked Area Chart: Impact of Different Campaigns Over Time

pgsql

[Stacked area chart showing cumulative sales from different campaigns]

Common Data Visualization Techniques


Chart
Type Purpose & Example Use Case Visual Representation

Bar Chart Compare sales across regions or ![Bar Chart](https://i.imgur.com/8QLine


categories. Plot**

Summary Table: Types of Data Analytics

Type Key Question Example Scenario Typical Visuals

Exploratory What’s in the data? Initial review of sales data Histograms, Boxplots

28/29
Type Key Question Example Scenario Typical Visuals

Descriptive What happened? Find max daily sales Bar, Line, Pie Charts

Diagnostic Why did it happen? Low sales on Mondays Scatter, Heatmap

Predictive What could happen? Forecast next month’s sales Line, Area Charts

Prescriptive What should we do? Choose best ad campaign Stacked Area, Treemap

Key Takeaways
Exploratory analysis is the foundation—always start here to understand your data.

Descriptive analytics summarizes what has happened; diagnostic analytics explains


why.

Predictive analytics uses past data to forecast the future; prescriptive analytics
recommends actions.

Data visualization is essential for communicating findings—choose the right chart for
your question and data type.

Real-world analytics projects often combine several types for robust, actionable insights.

Tip for Self-Learning:


Practice each analytics type with real or sample datasets. Use spreadsheet tools or Python/R
to create visualizations and answer each type of analytics question for your chosen data.

29/29
Data Analytics Process
The Data Analytics Process: Detailed Notes with Examples
and Visuals

1. Define the Problem and Desired Outcome


Purpose:
Start by clearly identifying the question you want to answer and the real-world decision you
want to influence.

Example:
A social media content creator wants to increase the number of views on their reels.

Current state: Data on past reels’ performance.

Desired outcome: More views or clicks.

Visual:

pgsql

[Where am I now?] → [Where do I want to be?]


(Current data) (Goal/Metric)

2. Set a Clear, Measurable Metric


Purpose:
Decide what you will measure and how you will measure it. This ensures your project has a
concrete target.

Example:
“I want to increase the number of views on my reels by 5% in the next month.”

Visual:

csharp

1/61
[Metric: Number of Views]
[Target: +5% in 1 month]

3. Gather and Integrate Data


Purpose:
Collect all relevant data from various sources. Data may come in different formats and need
to be brought together.

Example:
A content creator gathers engagement data (likes, views, comments) from Instagram,
YouTube, and Facebook.

Visual:

less

[Instagram] [YouTube] [Facebook]


\ | /
\ | /
[Combine into Unified Dataset]

4. Clean and Prepare Data


Purpose:
Fix errors, remove duplicates, and standardize formats to ensure accurate analysis.

Example:

Convert all date formats to DD-MM-YYYY.

Remove duplicate entries for the same reel.

Fill in missing values or decide how to handle them.

Visual:

css

2/61
[Raw Data] → [Remove Errors] → [Standardize Formats] → [Clean Data]

5. Analyze the Data


Purpose:
Use statistical techniques to identify patterns, trends, and relationships.

Example:

Calculate average daily views.

Identify which days of the week get the most views.

Look for correlations (e.g., do reels with hashtags get more views?).

Visual:

Bar Chart: Average Views by Day

markdown

| Views |
| | ■
| | ■ ■
| | ■■ ■■
|_______|____________
Mon Tue Wed

Line Chart: Views Over Time

javascript

| Views |
| /\ /
| / \/
|_/______ Time

6. Interpret the Results

3/61
Purpose:
Make sense of the findings. Link the results back to your original question and goal.

Example:

Highest views occur on weekends.

Reels posted with trending hashtags get 20% more views.

Visual:

csharp

[Analysis] → [Insight: Post on weekends for more views]

7. Present and Communicate Insights


Purpose:
Share your findings in a clear, impactful way—often using visualizations or dashboards.

Example:

Create a dashboard showing daily views, best-performing reels, and engagement trends.

Recommend posting new reels on weekends and using trending hashtags.

Visual:

Dashboard Example:

pgsql

+-------------------------------+
| [Line Chart: Views Over Time]|
| [Bar Chart: Views by Day] |
| [Table: Top Performing Reels]|
+-------------------------------+

8. Take Action and Measure Impact

4/61
Purpose:
Implement recommendations, then track if your actions lead to the desired outcome.

Example:

Start posting reels on weekends.

After a month, check if views increased by 5%.

Visual:

css

[Action Taken] → [Measure New Data] → [Compare to Target]

Summary Table: Data Analytics Workflow


Step Purpose Example Visual Tool

Define Set clear question & outcome Increase reel views Goal Diagram
Problem/Goal

Set Metric Choose what/how to measure +5% views in 1 month KPI Target

Gather Data Collect from all sources Instagram, YouTube, Data Flow
Facebook Chart

Clean Data Fix errors, standardize Remove duplicates, fix Data Pipeline
dates

Analyze Data Find patterns, trends Views by day, hashtags Charts/Graphs


effect

Interpret Results Link findings to question Weekends best for Insights Box
posting

Present Insights Communicate with Dashboard, report Dashboard


visuals/dashboards

Take Implement & track impact Post on weekends, check Before/After


Action/Measure results

5/61
Real-World Example: Netflix Recommendations
Data Collected: What users watch, when, device used, pauses, ratings, searches.

Analytics Used: Patterns in viewing behavior to personalize recommendations.

Impact: Over 75% of viewer activity is driven by these data-driven suggestions, boosting
user engagement and business success.

Visual:

css

[User Data] → [Pattern Analysis] → [Personalized Recommendations] → [Increased


Viewing]

The Role of Data Visualization


Purpose: Make insights accessible and actionable.

Tools: Tableau, Power BI, Excel, Python (Matplotlib/Seaborn).

Visuals Used: Bar charts, line graphs, scatter plots, dashboards.

Key Principle: Good visuals tell a clear story and highlight what matters most.

Key Takeaways
The data analytics process is stepwise: define the problem, set measurable goals, gather
and clean data, analyze, interpret, present, and act.

Each step builds on the previous, ensuring reliable and actionable insights.

Visualization is crucial for communicating findings and driving real-world decisions.

Real-world examples like Netflix and retail analytics show the power of this process in
action.

6/61
Tip for Self-Learning:
Practice each step with a small project: pick a question, collect some data (even from your
own social media or daily activities), and walk through the process, using charts and
dashboards to communicate your findings.

Types of Data in Data Analytics: Detailed Notes with


Examples and Visuals

1. Data Classification by Structure


Data can be classified based on its structure into three main types:

A. Structured Data

Definition: Data organized in a rigid, predefined format—typically rows and columns.

Examples:

Spreadsheets (Excel, Google Sheets)

Relational databases (SQL)

Survey forms

GPS datasets (latitude/longitude in tables)

Features:

Easy to filter, sort, and analyze using statistical tools.

Numeric and some text fields.

Visual:

pgsql

+--------+-------+-------+
| Name | Age | City |
+--------+-------+-------+
| Alice | 23 | Pune |
| Bob | 35 | Delhi |
+--------+-------+-------+

7/61
B. Semi-Structured Data

Definition: Data that has some organizational properties (like tags or hierarchies) but
does not fit rigid tables.

Examples:

Emails (structured headers + unstructured body text)

JSON, XML files

Logs with consistent fields but variable content

Product catalogs with varying attributes

Features:

Mix of structured and unstructured elements.

Often used for data collected from multiple sources.

Visual:

json

{
"id": "P001",
"name": "Gaming Monitor",
"specifications": {
"screen_size": "27 inch",
"refresh_rate": "165Hz"
}
}

C. Unstructured Data

Definition: Data without a predefined structure; cannot be neatly arranged in tables.

Examples:

Images, videos, audio files

Social media posts, tweets, Facebook feeds

Web pages, blog posts

Features:

8/61
Requires advanced processing (NLP for text, image processing for visuals).

Cannot be directly analyzed with standard statistical tools.

Visual:

mathematica

[Image File: X-ray]


[Video Clip: Security footage]
[Text: "Great product! Will buy again."]

2. Data Classification by Content

A. Numeric (Quantitative) Data

Definition: Data that can be measured or counted.

Examples:

Age of users

Prices of goods

Sales/revenue data

Stock prices

Temperature, rainfall

Features:

Suitable for statistical analysis (mean, median, standard deviation).

Often found in structured formats.

Visual:

diff

+-------+--------+
| Item | Price |
+-------+--------+
| Pen | 10 |
| Book | 150 |
+-------+--------+

9/61
B. Text (Qualitative) Data

Definition: Data in the form of words, sentences, or paragraphs.

Examples:

Customer reviews

Social media posts

Emails, memos

Blog posts

Features:

Requires Natural Language Processing (NLP) for analysis.

Used for sentiment analysis, topic modeling, etc.

Visual:

arduino

"The delivery was fast and the product quality is excellent!"

C. Visual Data

Definition: Data in the form of images, videos, or other visual formats.

Examples:

Product photos

X-ray images, CT scans

Videos from surveillance or user-generated content

Features:

Requires image/video processing tools.

Used in quality control, diagnostics, self-driving cars, etc.

Visual:

mathematica

10/61
[Image: Product photo]
[Video: Assembly line footage]

3. Data Types in Statistics

Data Type Description Example Values Suitable Visualizations

Nominal Categories with no order Red, Blue, Green Bar chart, Pie chart

Ordinal Categories with a natural order Low, Medium, Bar chart, Ordered bar
High chart

Discrete Countable numbers, no fractions 1, 2, 3, 4 Bar chart, Histogram


students

Continuous Measurable quantities, can take 1.5m, 2.3kg, Histogram, Boxplot, Line
any value 37.5°C chart

4. Common Data Visualizations


Boxplots: Show distribution, median, quartiles, and outliers for numeric data.
Example: Comparing exam scores across classes.

Scatterplots: Show relationships between two variables (e.g., height vs. weight).

Line Graphs: Show trends over time (e.g., monthly sales).

Area Charts: Show cumulative totals over time (e.g., sales volume by region).

Bar Charts: Compare quantities across categories (e.g., sales by product).

Pie Charts: Show proportions of a whole (e.g., market share by brand).

Visual Example: Line Graph for Sales Over Time

javascript

| Sales |
| /\ /

11/61
| / \/
|_/______ Time

5. Real-World Scenarios
Structured Data:
Retail sales spreadsheet – Easily filter to find top-selling products.

Semi-Structured Data:
Email dataset – Analyze sender/receiver patterns, but email body needs text processing.

Unstructured Data:
Social media images – Use image recognition to identify brand logos in posts.

6. Key Takeaways
Data can be structured, semi-structured, or unstructured based on organization.

Content-wise, data can be numeric, text, or visual.

The structure and content of your data determine which analysis and visualization
methods you can use.

Understanding your data type is the first step in any analytics project.

Tip for Self-Learning:


Practice classifying datasets you encounter (spreadsheets, emails, images) by their structure
and content. Try visualizing simple numeric datasets using bar or line charts in Excel or
Google Sheets to reinforce your understanding.
1/2
Types of Data in Data Analytics: Detailed Notes with
Examples and Visuals

12/61
1. What is Data?
Data is any information—facts, statistics, or observations—that can help us make decisions
or uncover insights. In data analytics, understanding the type, structure, and format of your
data is a crucial first step before any analysis.

2. Main Types of Data (Statistical Classification)


Data can be categorized into four main statistical types:

A. Nominal Data (Categorical, No Order)

Definition: Data that represents categories or groups without any inherent order or
ranking.

Examples:

Gender: Male, Female, Other

Nationality: Indian, American, British

Location: Urban, Rural, City, Town, State

How to Analyze:

Frequency tables

Bar charts

Visual Example:

csharp

Bar Chart: Number of Customers by Nationality


| ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
| India
| ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
| USA
| ■■■■■■■■■■■■■■■■■■■
| UK

B. Ordinal Data (Categorical, Ordered)

13/61
Definition: Data that represents categories with a clear order or ranking, but the
differences between the ranks are not necessarily equal or quantifiable.

Examples:

Customer satisfaction: Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied

Education level: High School, Bachelor’s, Master’s, PhD

How to Analyze:

Median value (middle rank)

Bar or pie charts

Visual Example:

less

Pie Chart: Customer Satisfaction Ratings


[Very Satisfied] [Satisfied] [Neutral] [Unsatisfied] [Very Unsatisfied]

C. Discrete Data (Quantitative, Countable)

Definition: Numeric data that can only take specific, separate values (usually whole
numbers).

Examples:

Number of employees: 5, 10, 15

Number of cars in a parking lot: 12, 20, 30

How to Analyze:

Mode (most frequent value)

Histogram or bar chart

Visual Example:

sql

Histogram: Number of Customers Per Day


| ■■■■■■
| 5 customers
| ■■■■■■■■■■■

14/61
| 10 customers
| ■■■■■■■
| 15 customers

D. Continuous Data (Quantitative, Measurable)

Definition: Numeric data that can take any value within a range, including fractions and
decimals.

Examples:

Height: 1.75 m

Temperature: 22.5°C

Time: 3.6 seconds

How to Analyze:

Mean (average), standard deviation

Histograms, scatter plots, line charts

Visual Example:

mathematica

Line Chart: Temperature Over a Week


| Temp (°C)
| /\
| / \
|___/____\____ Days

3. Quantitative vs. Qualitative Data


Quantitative Data: Numeric, measurable (includes both discrete and continuous data)

Examples: Age, salary, number of purchases

Qualitative Data: Non-numeric, descriptive (often text or categories)

Examples: Feedback comments, colors, product types

15/61
Visual Table:

Type Numeric? Example

Quantitative Yes 25 years, 5 units

Qualitative No "Blue", "Satisfied"

4. Data Structure Types

A. Structured Data

Definition: Organized in rows and columns (tables, spreadsheets).

Examples: Excel files, SQL databases.

Visual Example:

pgsql

+--------+-------+--------+
| Name | Age | City |
+--------+-------+--------+
| Alice | 23 | Pune |
| Bob | 35 | Delhi |
+--------+-------+--------+

B. Semi-Structured Data

Definition: Has some structure but not rigid tables (uses tags or keys).

Examples: JSON, XML, emails.

Visual Example:

json

{
"name": "Alice",
"age": 23,
"city": "Pune"
}

16/61
C. Unstructured Data

Definition: No predefined structure.

Examples: Images, videos, free-form text, PDFs.

Visual Example:

mathematica

[Image: Product Photo]


[Text: "Great product!"]
[PDF: Invoice]

5. Common Data File Formats


Delimited Text Files: CSV (comma-separated), TSV (tab-separated)

Spreadsheets: Excel (.xlsx), Google Sheets

Markup Languages: XML

PDFs: Portable Document Format, often for reports/invoices

JSON: JavaScript Object Notation, widely used for web data

Visual Example: CSV File

pgsql

Name,Age,City
Alice,23,Pune
Bob,35,Delhi

6. How Type and Format Affect Analysis


Nominal/Ordinal: Use frequency tables, bar/pie charts

Discrete/Continuous: Use histograms, line charts, calculate averages or modes

Structured Data: Easily analyzed with SQL, Excel

Unstructured Data: Needs text/image processing tools

17/61
7. Real-World Example: Netflix Data Analytics
Netflix collects a variety of data types:

Structured: Viewing logs (user, show, time watched)

Semi-Structured: Device info (JSON format)

Unstructured: User reviews (text), pause/resume behavior

Netflix uses this data to:

Build user profiles (nominal/ordinal data)

Track watch time (continuous data)

Recommend shows (cluster analysis, time-series analysis)

Visual: Netflix Data Flow

css

[User Activity] → [Data Collection] → [Cleaning & Structuring] → [Analysis &


Recommendations]

8. Summary Table: Data Types and Examples

Data Type Example Visual Tool

Nominal Gender, Nationality Bar Chart

Ordinal Satisfaction Rating Pie/Bar Chart

Discrete Number of Purchases Histogram/Bar Chart

Continuous Temperature, Height Line/Scatter Plot

Qualitative Comments, Colors Word Cloud

Quantitative Age, Salary Statistical Summary

18/61
9. Key Takeaways
Understanding data types (nominal, ordinal, discrete, continuous) is essential for correct
analysis and visualization.

Data can also be classified by structure (structured, semi-structured, unstructured) and


by content (quantitative, qualitative).

The type and format of data determine which tools and statistical methods you should
use.

Real-world analytics projects, like those at Netflix, combine multiple data types and
formats to generate insights and drive decisions.

Tip for Self-Learning:


Practice by classifying data you encounter—try making bar charts for nominal data,
histograms for discrete data, and line charts for continuous data using spreadsheet tools or
Python libraries.

Detailed Notes: Common Sources and Types of Data in Data


Analytics

1. Introduction: The Explosion of Data Sources


The digital era has revolutionized the variety and volume of data sources available for
analysis.

Data is no longer just numbers in spreadsheets; it now includes text, images, sensor
readings, and more, reflecting human behavior, choices, and experiences.

2. Common Sources of Data

A. Relational Databases

19/61
Definition: Organized collections of structured data stored in tables (rows and columns)
with defined relationships.

Examples:

Customer transactions

Business activities

Human resource records

Tools: SQL, Oracle, MySQL, IBM DB2

Use Case:

Linking customer purchase data with product details and payment methods to
analyze sales or plan promotions.

Visual:

less

+-------------+ +-------------+ +-------------+


| Customer ID |----->| Purchases | [Raw Data] --(Processing)--> [Insights]

E. Data Streams and Feeds

Definition: Continuous, real-time flows of data from various sources.

Examples:

IoT devices (smart sensors, GPS)

Social media feeds (live posts, trending hashtags)

Financial tickers (stock prices)

Retail transaction streams

Surveillance video feeds

Tools: Apache Kafka, Spark, Storm

Use Case:

Real-time monitoring of machine health or social media trends.

Visual:

20/61
css

[Sensor/Feed] → [Data Stream] → [Real-Time Analytics]

F. Satellite Imagery

Definition: Images and data captured from satellites for various analyses.

Examples:

Weather forecasting

Tracking deforestation or urban growth

Nighttime imagery for economic activity

Use Case:

Using satellite data to predict rainfall for agricultural planning.

Visual:

css

[Satellite] → [Image Data] → [Processed Insights]

3. Types of Data Sources by Ownership

A. First-Party Data

Definition: Data collected directly by an organization from its own customers or


operations.

Examples:

Website analytics

Transaction records

Customer feedback surveys

B. Third-Party Data

Definition: Data collected by external organizations or agencies.

21/61
Examples:

Market research reports

Government census data

Economic indicators

C. Intentionally Collected Data

Definition: Data gathered specifically for a project, often via surveys, interviews, or
observations.

Tools: SurveyMonkey, Google Forms

4. Population vs. Sample Data


Population Data: The entire set of data you wish to study (e.g., all customers).

Sample Data: A representative subset of the population, used when collecting all data is
impractical.

Importance: Sampling saves time and resources but must be done carefully to avoid
bias and ensure representativeness.

Visual:

scss

[Population: All Customers]


↓ (Sampling)
[Sample: 500 Customers]

5. Data Repositories: Organizing and Storing Data


Data Repository: A centralized storage system for organizing, categorizing, and
preserving data for analysis.

Types:

Databases (structured)

Data warehouses

22/61
Data lakes (for unstructured/semi-structured data)

Importance: Efficient data storage and retrieval streamline the analytics process and
support troubleshooting.

Visual:

css

[Multiple Sources] → [Data Repository] → [Analysis & Reporting]

6. Databases: Structured vs. Non-Relational

A. Relational Databases

Structure: Tables with rows and columns, linked by common identifiers.

Advantages:

Minimize redundancy

Enforce data consistency

Efficient querying (SQL)

Limitations:

Not suitable for unstructured or semi-structured data

Migration requires matching schemas

B. Non-Relational Databases

Structure: Schema-less, can store semi-structured or unstructured data (e.g., NoSQL).

Advantages:

Handle diverse, large, and rapidly changing data

Built for speed, flexibility, and scale

7. Real-World Example: Retail Store Relational Database


Tables:

23/61
1. Purchases: Customer ID, Product ID, Purchase Date

2. Products: Product ID, Name, Price

3. Customers: Customer ID, Name, Payment Method

How It Works:

Link tables via Product ID and Customer ID to analyze sales, customer behavior, and
product performance.

Visual:

pgsql

Purchases Table Products Table Customers Table


+-----------+ +----------+ +-----------+
|CustID|ProdID| |ProdID|Name| |CustID|Name|
+-----------+ +----------+ +-----------+
| 101 | 453 | | 453 |Banana| | 101 |Amit |
+-----------+ +----------+ +-----------+

8. Key Takeaways
Data sources are more diverse than ever: databases, files, APIs, web scraping, streams,
satellite imagery, and more.

Organizing data in repositories (databases, warehouses, lakes) is crucial for efficient


analytics.

Relational databases are ideal for structured data; non-relational databases handle
variety and volume.

Sampling is often necessary for practical analysis—ensure samples are representative.

The right choice of data source and storage depends on your data’s structure, intended
use, and analysis needs.

Self-Learning Tip:
Practice identifying and categorizing data sources in your daily life (e.g., your bank’s app,

24/61
online shopping, or social media), and sketch simple diagrams to visualize how data flows
from source to analysis.

Data Repositories in Data Analytics: Detailed Notes with


Examples and Visuals

1. Introduction: Why Data Repositories Matter


Data repositories are centralized storage solutions that allow you to collect, organize,
and manage data from various sources and formats.

Choosing the right repository is crucial for efficient data analysis, reporting, and
decision-making.

2. Types of Data Repositories

A. Relational Databases

Definition: Store data in structured tables with rows and columns; relationships between
tables are defined using keys.

Examples: MySQL, PostgreSQL, Oracle, SQL Server.

Best for: Structured data with clear relationships (e.g., customer orders, inventory).

Visual:

pgsql

+-----------+ +-----------+
| Customers | | Orders |
+-----------+ +-----------+
| CustID || CustID |
| Name | | OrderID |
+-----------+ +-----------+

B. Non-Relational (NoSQL) Databases

25/61
Definition: Designed for flexibility, scale, and speed; handle semi-structured or
unstructured data.

Types:

Key-Value Store: Data stored as key-value pairs (e.g., Redis, Memcached).

Example: User session data where each user ID (key) maps to session info
(value).

Visual:

css

Key: User123 → Value: {session_start: "12:00", cart: ["item1", "item2"]}

Document Store: Each record is a document (e.g., MongoDB).

Example: E-commerce order stored as a JSON document.

Visual:

json

{
"order_id": 101,
"items": ["Pen", "Notebook"],
"total": 150
}

Column Store: Data stored in columns rather than rows (e.g., Cassandra).

Example: Time-series sensor data, where each column family holds readings for
a device.

Visual:

markdown

DeviceID | 2025-04-23 | 2025-04-24 | ...


-----------------------------------------
D001 | 23°C | 24°C | ...

Graph Store: Data as nodes and edges to represent relationships (e.g., Neo4j).

Example: Social network where nodes are users and edges are friendships.

26/61
Visual:

less

[Alice]---(friend)---[Bob]
| |
(likes) (follows)
| |
[Post1] [Charlie]

Best for: High-volume, high-velocity, and varied data formats (e.g., social media, IoT,
product catalogs).

C. Data Warehouses

Definition: Large, centralized repositories that consolidate cleaned and structured data
from multiple sources for analysis and reporting.

Features:

Use ETL (Extract, Transform, Load) to process and store data.

Store both recent and historical data.

Examples: Amazon Redshift, Google BigQuery, Snowflake.

Best for: Company-wide analytics, business intelligence, and historical trend analysis.

Visual:

less

[Sales DB] [Marketing DB] [Web Logs]


\ | /
\ | /
[Data Warehouse: Clean, Structured Data]

D. Data Marts

Definition: Subsections of a data warehouse focused on a specific business area or user


group.

27/61
Features:

Contains only relevant data for a particular department or project.

Offers isolated security and performance.

Example: A sales data mart for the sales team, containing only sales-related data.

Visual:

csharp

[Data Warehouse]
|
[Sales Data Mart]
[HR Data Mart]
[Finance Data Mart]

E. Data Lakes

Definition: Storage repositories that can hold vast amounts of raw data in its original
format—structured, semi-structured, or unstructured.

Features:

Schema-on-read: Data is structured as needed during analysis.

Can store logs, images, videos, sensor data, etc.

Each data element has a unique identifier and a tag.

Examples: Amazon S3, Azure Data Lake, Hadoop.

Best for: Big data, machine learning, and scenarios where you want to retain all data for
future use.

Visual:

csharp

[Raw Data: CSV, JSON, Images, Videos]



[Data Lake: All Formats, Tagged]

28/61
3. Choosing the Right Repository

Repository Type Structure Best For Example Use Case

Relational Structured Transactional data, clear Customer orders, inventory


Database relationships management

NoSQL Database Flexible Big data, varied formats, fast Social media, IoT, product
access catalogs

Data Warehouse Structured Analytics, reporting, historical Company-wide BI, sales trend
trends analysis

Data Mart Structured Department-specific analytics Sales team reports, marketing


analysis

Data Lake All Big data, raw/unprocessed Machine learning, storing all
formats data raw logs

4. Visual Summary

pgsql

+-------------------+ +------------------+ +-----------------+


| Relational DB | | Data Warehouse | | Data Lake |
| (Tables) | | (Cleaned, | | (All Formats, |
| | | Structured Data) | | Raw Data) |
+-------------------+ +------------------+ +-----------------+
\ | /
\ | /
+----------------------+------------------------+
|
[Analytics & Reporting]

5. Key Takeaways
Relational databases are best for structured, linked data.

29/61
NoSQL databases offer flexibility for varied, high-volume data.

Data warehouses centralize cleaned data for analysis; data marts focus on specific
needs.

Data lakes store all types of data in their native formats, ideal for big data and advanced
analytics.

The choice of repository depends on your data’s structure, volume, and intended use.

Tip for Self-Learning:


Practice identifying which repository would best fit different data scenarios (e.g., storing
social media posts, historical sales data, or raw sensor feeds). Sketch diagrams to visualize
how data flows from sources into repositories and then into analytics tools.

Data Privacy, Security, and Ethics in Data Analytics:


Detailed Notes with Examples and Visuals

1. Introduction
As data analytics becomes central to business and research, handling data responsibly is
critical. This involves not just technical skills, but also understanding data privacy, data
security, and data ethics. These principles ensure individuals’ rights are protected and
organizations use data in trustworthy, lawful, and ethical ways.

2. Data Privacy
Definition:
Data privacy refers to the principles and regulations governing how personal data is
collected, used, stored, shared, and handled.

Why It Matters:

Prevents identity theft and misuse of personal information

Builds consumer trust and brand reputation

30/61
Ensures compliance with laws (e.g., GDPR, CCPA, DPDP)

Promotes ethical use of data

When Is It a Concern?

When data can be used to uniquely identify an individual (not just aggregate statistics)

Types of Data with Privacy Concerns:

Personally Identifiable Information (PII): Data that can directly identify a person (e.g.,
name, phone number, email, Aadhaar/PAN number, date of birth)

Personal Information: Includes PII and other data that can be linked to a
person/household (e.g., IP address, geolocation, case numbers)

Sensitive Information: Data that, if leaked, can cause harm (e.g., genetic data,
political/religious beliefs)

Visual Example:

pgsql

+---------------------+---------------------+---------------------+
| PII | Personal Info | Sensitive Info |
|---------------------|---------------------|---------------------|
| Name, Email, | IP Address, | Genetic Info, |
| Phone, Aadhaar | Geolocation, | Political Beliefs, |
| | Video w/ Face | Religion |
+---------------------+---------------------+---------------------+

3. Techniques for Protecting Data Privacy

A. Data Anonymization

Definition: Modifying data to make it impossible or very difficult to identify individuals.

Methods:

Removing PII: Replace names with unique IDs.

Example Table:

pgsql

31/61
| Name | Age | City | → | ID123 | 25 | Delhi |
|----- |-----|---------| | ID124 | 32 | Mumbai |

Generalization: Reduce detail (e.g., replace full address with city, exact date of birth
with age).

Example: “123 Main St, Mumbai” → “Mumbai”

Perturbation: Add small, random changes (“noise”) to data so exact values are
hidden but overall trends remain.

Example: Salary values are slightly altered for privacy, but average remains
accurate.

Data Masking: Hide part of sensitive data but keep its structure.

Example: Show only last 2 digits of Aadhaar: “XXXX-XXXX-12”

B. Data Masking vs. Anonymization

Anonymization: Irreversible; makes re-identification impossible.

Masking: Reversible; hides data but can be restored if needed.

Visual:

yaml

Original: 1234-5678-9012
Masked: XXXX-XXXX-9012

C. Tools for Anonymization

ARX: Open-source tool for large datasets

IBM Guardian: Protects sensitive data across environments

Google TensorFlow Privacy: Privacy for machine learning models

4. Data Security
Definition:
Protecting data from unauthorized access, breaches, and misuse.

How to Ensure Data Security:

32/61
Collect only necessary data (minimize risk)

Restrict access: Only authorized team members can view/use data

Incident response plan: Have protocols for breaches or hacks

Encryption: Store and transmit data securely

Visual:

csharp

[Data Repository]
|
[Access Controls]
|
[Only authorized users]

5. Data Ethics
Definition:
Moral principles guiding how organizations collect, use, and share data.

Key Principles:

Transparency: Clearly communicate data practices to users

Accountability: Take responsibility for data handling and breaches

Individual Agency: Allow people to access, correct, or delete their data

Privacy: Protect personal data from unauthorized exposure

Visual:

csharp

[Ethical Data Use]


/ | \
Transparency Accountability Privacy

33/61
6. Legal Regulations
GDPR (Europe): Strict rules on consent and data use

CCPA (California): Consumer rights over personal data

DPDP (India): Digital Personal Data Protection Act

Always check the relevant law for your region and data type.

7. Real-World Example: Netflix


Netflix collects vast user data (what you watch, when, device, searches).

Privacy: They anonymize and aggregate data for recommendations.

Security: Only authorized staff can access raw user data.

Ethics: Users can control their profiles and viewing history.

8. Summary Table

Aspect What It Means Example/Technique

Data Privacy Protecting personal info Anonymization, Masking

Data Security Preventing unauthorized access Access control, Encryption

Data Ethics Using data responsibly Transparency, Consent

9. Key Takeaways
Data privacy is about protecting individual identities in your data.

Data security ensures only authorized people can access and use data.

Data ethics guides you to use data in a fair, transparent, and responsible way.

Use anonymization and masking to protect privacy.

34/61
Always comply with relevant data protection laws.

Tip for Self-Learning:


Whenever you work with real datasets, practice anonymizing sensitive fields, set up access
controls, and reflect on how you would explain your data practices to the people whose data
you hold. Sketch diagrams to visualize data flows and privacy protections.

Detailed Notes: Identifying, Gathering, and Importing Data


in Data Analytics

1. The First Step: Identifying the Data You Need


Purpose:
Before any analysis, you must clearly define what data you need—this is guided by the main
question or problem statement of your project.

How to Identify Data Needs:

Start with your question: What problem are you trying to solve?

Determine required information: What specific details will help answer your question?

Be precise: Move from a vague idea to a detailed list of variables and data sources.

Example:
A retail company wants to run a targeted ad campaign for the festive season to boost sales.

Potential data needed:

Customer profiles (age, gender, location, purchase history)

Website visits and product views

Customer satisfaction survey responses

Customer complaints

Social media mentions and sentiment

Visual:

css

35/61
[Business Question] → [List of Needed Data Types]

2. Locating and Planning Data Collection


Key Questions:

Where does each type of data reside?

Who collects and maintains it?

How often is it updated (timeline)?

How much data is needed (sample size)?

Example Table: Data Types and Sources

Data Needed Likely Source Update Frequency

Customer profiles Internal database On purchase/signup

Website visits/views Web analytics tools Real-time/Hourly

Survey responses Collected via survey Periodic/On demand

Complaints Customer support logs Ongoing

Social media mentions Social platforms/APIs Real-time

Visual:

less

[Customer DB] [Web Analytics] [Survey Tool] [Support Logs] [Social Media]
\ | | | /
\ | | | /
+-------------------------------------------------------------+
| Data Sources for Retail Ad Campaign |
+-------------------------------------------------------------+

36/61
3. Deciding on Data Collection Methods
Factors Influencing Methods:

Data source (internal, external, third-party)

Data type (structured, semi-structured, unstructured)

File format (CSV, JSON, XML, SQL, etc.)

Volume and frequency of data generation

Common Methods:

SQL Queries: For structured data in relational databases (e.g., customer profiles,
purchase history).

APIs: For web-based or platform data (e.g., social media feeds, some web analytics).

Web Scraping: For extracting information from web pages (e.g., competitor prices,
product reviews).

Surveys: For collecting new, specific information (e.g., customer satisfaction).

Data Streams: For real-time, continuously updating data (e.g., IoT sensors, live website
activity).

Visual: Data Collection Methods

scss

[Relational DB] --(SQL Query)--> [Extracted Table]


[Social Media] --(API)---------> [JSON Data]
[Web Page] --(Scraping)----> [Raw HTML/Text]
[Customers] --(Survey)------> [Survey Responses]

4. Planning the Timeline and Sample Size


Timeline:

Some data is static (e.g., past purchases), some is dynamic (e.g., live website visits).

Decide how often to collect or refresh each data type (e.g., hourly, daily, weekly).

Sample Size:

37/61
Sometimes, you can’t collect all data—choose a representative subset (sample).

Larger samples generally yield more reliable insights, but take more resources to
process.

Visual:

scss

[All Data] --(Sampling)--> [Sample Data for Analysis]

5. Importing Data into a Common Repository


Purpose:
Combine all identified and gathered data into a single platform for analysis.

Steps:

Import structured data into relational databases (tables with rows/columns).

Import semi-structured data (e.g., JSON, XML) into NoSQL databases.

Import unstructured data (e.g., text, images) into data lakes or specialized repositories.

Use ETL tools or data pipelines (Extract, Transform, Load) to automate and manage the
import process.

Tools:

Python, R (for scripting and automation)

ETL platforms (e.g., Talend, Apache NiFi)

Database management systems (SQL, MongoDB, Hadoop)

Visual: Data Import Workflow

css

[Various Data Sources] → [ETL/Data Pipeline] → [Central Repository]

Repository types:

Relational DB (for structured)

38/61
NoSQL DB (for semi-structured)

Data Lake (for unstructured/mixed)

6. Example: Retail Ad Campaign Data Workflow


1. Identify: Need customer profiles, website activity, survey results, complaints, and social
mentions.

2. Locate:

Profiles in company DB

Web activity via Google Analytics

Surveys via SurveyMonkey

Complaints in support logs

Social mentions via Twitter API

3. Plan Collection:

Profiles: SQL query, one-time extract

Web activity: API, daily

Surveys: Collect over 2 weeks

Complaints: Ongoing, batch weekly

Social: API, real-time or daily

4. Import:

Use ETL pipeline to bring all datasets into a central data warehouse or data lake.

5. Ready for Analysis:

Clean, merge, and analyze data to create targeted ad campaign.

Visual: End-to-End Data Preparation

css

[Identify] → [Gather] → [Import] → [Analyze]

39/61
7. Key Takeaways
Identifying data needs is driven by your project’s main question.

Sources and methods vary by data type, format, and update frequency.

Importing data into a central repository is essential for efficient analysis.

Tools and techniques (SQL, APIs, ETL, web scraping) are chosen based on data structure
and source.

Tip for Self-Learning:


Practice by outlining a mini-project:

1. State your question.

2. List the data you need and where to find it.

3. Sketch how you’d collect and import each dataset.

4. Draw a simple diagram of your data flow from source to repository.

Visual Summary: Data Preparation Pipeline

csharp

[Define Question]

[Identify Data Needed]

[Locate & Gather Data]

[Import to Repository]

[Ready for Cleaning & Analysis]

This systematic approach ensures you start your data analytics project with the right, well-
organized data foundation.

Data Cleaning in Data Analytics: Detailed Notes with


Examples and Visuals

40/61
1. Why Data Cleaning Matters
Definition: Data cleaning is the process of correcting or removing inaccurate,
incomplete, or irrelevant data from a dataset to ensure quality analysis.

Importance: Unclean data can lead to misleading results, faulty conclusions, and poor
business decisions. Cleaning is often the most time-consuming yet crucial step in data
analytics.

2. The Data Cleaning Workflow


A. Inspecting the Data

Goal: Identify errors, inconsistencies, and formatting issues.

How:

Review variable types (e.g., dates, numbers, categories).

Check for impossible or illogical values (e.g., future transaction dates).

Compare against rules (e.g., all entries in a "date" column should be valid dates).

Use data profiling and basic visualizations (boxplots, histograms, scatterplots) to


spot outliers and anomalies.

Visual Example:

sql

[Raw Data Table]


| ID | Date | Amount |
|----|-------------|--------|
| 1 | 2025-02-01 | 500 |
| 2 | 2025-01-29 | 300 |
| 3 | 123456 | 400 | ← Not a date!
| 4 | 2025-02-02 | 600 |

Visual: Use a histogram to spot outliers in "Amount".

41/61
B. Cleaning the Data

Common Tasks:

Remove Duplicates: Keep only unique records (e.g., one transaction per transaction
ID).

Format Data Consistently: Ensure all entries of a variable have the same data type
and format (e.g., all dates in DD-MM-YYYY).

Standardize Units: Convert all measurements to the same unit (e.g., all prices in
INR, all weights in kg).

Fix Typos and Inconsistencies: Standardize spelling, remove extra spaces, unify
abbreviations (e.g., "UP" and "Uttar Pradesh" to "Uttar Pradesh").

Handle Missing Values: Decide whether to drop, fill, or impute missing data.

Drop rows with missing critical values.

Impute (fill) missing values using mean, median, or a placeholder.

Visual Example:

lua

Before Cleaning:
| Name | State |
|--------|--------------|
| Priya | Uttar Pradesh|
| Rahul | UP |
| Bhavna | Rajastan | ← Typo!
| Amit | Rajasthan |
| Suman | Rajasthan |
| Suman | Rajasthan | ← Duplicate

After Cleaning:
| Name | State |
|--------|--------------|
| Priya | Uttar Pradesh|
| Rahul | Uttar Pradesh|
| Bhavna | Rajasthan |
| Amit | Rajasthan |
| Suman | Rajasthan |

42/61
C. Verifying the Data

Goal: Confirm that all issues identified during inspection have been addressed.

How:

Re-inspect the cleaned data.

Check summary statistics and visualizations to ensure consistency.

Compare cleaned data to original to verify changes.

Document all changes for reproducibility and transparency.

Visual Example:

css

[Cleaned Data Table] → [Summary Statistics] → [Validation Checklist]

3. Common Data Cleaning Issues & Solutions

Issue Example Solution

Duplicate records Two identical transactions Remove duplicates

Inconsistent formats "2025-02-01", "01/02/2025", "123456" Standardize to one date format

Mixed units Prices in INR and USD Convert all to INR

Typos/abbreviations "UP", "Uttar Pradesh", "UttarPradesh" Standardize spelling

Missing values Blank age field Drop or impute value

Extra spaces/characters " Rajasthan ", "Rajasthan" Trim spaces

4. Tools for Data Cleaning


Excel/Spreadsheets:

Remove duplicates, format cells, find/replace, trim spaces, spell check, pivot tables.

43/61
Google DataPrep:

Cloud-based, handles structured and unstructured data, auto-detects anomalies.

Python (pandas, numpy):

Automate cleaning, handle large datasets, merge, transform, impute, and validate
data.

R (dplyr, data.table):

Data wrangling, merging, and cleaning with straightforward syntax.

Visual:

css

[Excel Icon] [Python Logo] [R Logo] [Google DataPrep]

5. Practical Example: Cleaning a Transaction Dataset


Step 1: Inspect

Find duplicate transaction IDs.

Check "purchase date" for non-date entries.

Scan "price" for mixed currencies.

Step 2: Clean

Remove duplicate rows.

Convert all "purchase date" entries to DD-MM-YYYY.

Convert all prices to INR.

Standardize state names: "UP" → "Uttar Pradesh".

Fix typos: "Rajastan" → "Rajasthan".

Step 3: Verify

Count unique transaction IDs.

Check for missing or out-of-range values.

Document all changes.

44/61
6. Visual Workflow Summary
csharp

[Raw Data]

[Inspect: Spot errors, visualize]

[Clean: Remove duplicates, fix formats, handle missing]

[Verify: Re-check, summarize, document]

[Analysis-Ready Data]

7. Key Takeaways
Data cleaning is essential for reliable analytics.

Steps: Inspect → Clean → Verify.

Common tasks: Remove duplicates, standardize formats/units, fix typos, handle missing
values.

Use tools like Excel, Python (pandas), R, and Google DataPrep for efficient cleaning.

Always document your cleaning process for transparency and reproducibility.

Tip for Self-Learning:


Practice by taking a messy dataset (e.g., a downloaded CSV with typos, missing values, and
mixed formats) and walk through these steps using Excel or Python. Visualize before and
after cleaning to see the impact!

45/61
Data Cleaning: Detailed Notes with Examples and Visuals

1. What is Data Cleaning?


Data cleaning is the process of identifying and correcting errors, inconsistencies, and
missing values in your dataset to ensure it is accurate, consistent, and ready for analysis.
Clean data is essential for meaningful analytics and reliable results.

2. Common Steps in Data Cleaning

A. Remove Duplicate Observations

Why: Duplicates can skew your analysis.

How: Identify records with the same unique identifier (e.g., transaction ID, customer ID)
and keep only one.

Example Table:

Transaction ID Customer Amount

1001 Amit 500

1002 Priya 300

1001 Amit 500

Visual:

css

[Raw Data] → [Remove Duplicates] → [Clean Data]

B. Standardize and Format Data

Why: Inconsistent formats make analysis difficult.

46/61
How:

Ensure all dates are in the same format (e.g., DD-MM-YYYY).

Standardize text entries (e.g., 'M', 'Male', 'MALE' → 'Male').

Convert column names to a consistent style (e.g., replace spaces with underscores,
use title case).

Python Example:

python

# Standardize column names


df.columns = [col.strip().replace(" ", "_").title() for col in df.columns]

Visual:

yaml

[Date: 12/01/2025, 2025-01-12, 12-01-2025] → [Standardized: 12-01-2025]

C. Handle Missing Data

Why: Missing values are common and, if not handled, can bias results or break analysis.

How:

1. Identify missing data:

Blanks, 'NA', '.', '-', or impossible values (e.g., age = -1 or 999).

2. Decide how to handle:

Fill (Impute): Use a placeholder (e.g., 'NA', 'Not Applicable'), mean/median for
numeric, or a special value.

Delete: Remove records if crucial information is missing.

Examples:

Fill: If 'location' is missing, fill with 'Not Applicable'.

Delete: If 'product ID' is missing in a sales dataset, drop the record.

Visual:

css

47/61
[Raw Data: Missing Values] → [Impute or Drop] → [Clean Data]

D. Fix Errors and Inconsistencies

Why: Errors (like negative ages) and inconsistencies (like mismatched values across
datasets) can mislead analysis.

How:

Use common sense and domain knowledge to spot impossible or unlikely values
(e.g., age = -5 or 150).

Cross-check with other sources if possible.

Replace errors with correct values or mark as missing if unsure.

Examples:

Age = -3 → Mark as missing or correct if possible.

If 'Devesh' has different ages in two datasets, verify and fix.

Visual:

css

[Age: -5] → [Error Detected] → [Mark as Missing or Correct]

E. Transform and Encode Data

Why: Data often needs to be transformed for analysis or modeling.

How:

Continuous Variables:

Standardization (Z-score):
xi −xˉ
z= ​

s

Min-Max Scaling:
xi −min(x)
x′ =

max(x)−min(x)

48/61
Log Transformation:
x′ = log(xi )

Categorical Variables:

One-hot encoding: Convert categories into binary columns.

Label encoding: Assign integers to categories.

Ordinal encoding: Map ordered categories to ranked integers.

Example Table (One-hot encoding):

Gender_Male Gender_Female

1 0

0 1

Visual:

mathematica

[Color: Red, Blue, Green] → [Red=1, Blue=2, Green=3] (Label Encoding)

F. Combine Datasets (Merging/Joining)

Why: Real-world analysis often requires data from multiple sources.

How:

Use unique keys (e.g., customer ID, school code) to join datasets.

Types of joins: one-to-one, one-to-many, many-to-one.

Example Visual:

pgsql

[Table 1: Customer Info] + [Table 2: Purchases] → [Merged Data on Customer ID]

49/61
3. Handling Missing Data: Key Considerations
Understand the reason: Missingness may be due to privacy, non-response,
inapplicability, or technical errors.

Consistent indicators: Use a standard placeholder (e.g., 'NA') for all missing entries.

Be cautious: Deleting records with missing data can bias results, especially if
missingness is not random.

Example:

Missing 'years of schooling' for infants → fill with 'Not Applicable'.

Missing 'product ID' in a sales record → drop the record.

4. Data Cleaning in Practice: Example Workflow


Step-by-Step Example: Cleaning a Retail Transaction Dataset

1. Remove duplicates: Only one record per transaction ID.

2. Standardize formats: All dates in DD-MM-YYYY, all state names in full form.

3. Handle missing values:

If 'location' missing, fill with 'NA'.

If 'product ID' missing, drop the row.

4. Fix errors:

Negative or absurd ages marked as missing.

Check for mismatches in customer info across datasets.

5. Transform variables:

Standardize purchase amounts.

Encode payment mode (e.g., 'Cash'=1, 'Card'=2).

6. Merge with product info:

Join with product table using 'product ID'.

Visual Workflow:

csharp

50/61
[Raw Data]

[Remove Duplicates]

[Standardize Formats]

[Handle Missing Values]

[Fix Errors]

[Transform/Encode]

[Merge Datasets]

[Analysis-Ready Data]

5. Visual Techniques for Data Cleaning


Boxplots: Detect outliers in numeric data.

Histograms: Spot skewness or gaps

Feature Engineering in Data Analytics: Detailed Notes with


Examples and Visuals

1. What is Feature Engineering?


Feature engineering is the process of creating new variables (features) from existing raw
data to enhance the quality and effectiveness of your data analysis or predictive models. It
helps uncover hidden patterns and relationships that may not be obvious in the original
dataset.

2. Why is Feature Engineering Important?

51/61
Unlocks hidden insights: Extracts more information from your data without collecting
new data.

Improves model performance: Well-crafted features can make predictive models more
accurate.

Enables domain-driven analysis: Incorporates expert knowledge to create features that


matter for your specific problem.

3. Feature Engineering: Key Approaches and Examples

A. Extracting Information from Existing Features

Example 1: Date to Day of Week and Weekend/Weekday

Raw Feature: Purchase Date (e.g., "2025-02-01")

Engineered Features:

Day of Week: Monday, Tuesday, ..., Sunday

Is Weekend: Yes/No (or 1/0)

Why: Retailers may see more transactions on weekends; knowing the day can reveal
sales patterns.

Visual:

Purchase Date Day of Week Is Weekend

2025-02-01 Saturday Yes

2025-02-03 Monday No

B. Mathematical Transformations

Example 2: Calculating BMI from Height and Weight

Raw Features: Height (in meters), Weight (in kg)

Engineered Feature: BMI = Weight / (Height²)

Why: BMI is more informative for health analysis than height or weight alone.

52/61
Visual:

Name Height (m) Weight (kg) BMI

Priya 1.60 60 23.4

Amit 1.75 75 24.5

C. Aggregating Data

Example 3: Total Transaction Amount in Wide Format

Raw Data (Wide Format): Amount spent on each product in a transaction

Engineered Feature: Total Amount = Sum of all product amounts in that transaction

Why: Useful for analyzing transaction value or customer spending.

Visual:

Transaction ID Amount 1 Amount 2 Amount 3 Total Amount

53006 40 60 100

53668 80 80

D. Categorizing or Binning Data

Example 4: Creating Age Groups

Raw Feature: Age (continuous variable)

Engineered Feature: Age Group (e.g., 0-18: Child, 19-35: Young Adult, 36-60: Adult, 60+:
Senior)

Why: Age groups may show different behaviors or preferences.

Visual:

53/61
Age Age Group

17 Child

28 Young Adult

62 Senior

E. Combining Features

Example 5: Creating a “High Value Customer” Flag

Raw Features: Total Spend, Number of Purchases

Engineered Feature: High Value Customer (Yes/No, based on thresholds)

Why: Helps target marketing or loyalty programs.

4. Hierarchical Data and Feature Engineering


Hierarchical data has multiple levels (e.g., dates > transactions > products). Feature
engineering can be performed at any level:

Transaction Level: Total amount, number of products per transaction

Date Level: Total sales per day, number of transactions per day

Visual: Hierarchical Structure

css

[Date]
└── [Transaction]
└── [Product 1]
└── [Product 2]

5. Long vs. Wide Data Format

54/61
Long Format: Each row is a single observation (e.g., one product in one transaction).

Wide Format: Each row is a higher-level unit (e.g., one transaction), with multiple
columns for each sub-feature (e.g., Amount 1, Amount 2, ...).

Visual Example:

Long Format:

Transaction ID Product Amount

53006 Banana 40

53006 Grapes 60

Wide Format:

Transaction ID Product 1 Amount 1 Product 2 Amount 2

53006 Banana 40 Grapes 60

6. Feature Engineering and Domain Knowledge


Domain knowledge helps identify which new features may be most useful.

For example, a retailer knows that “weekend” transactions might be higher, so


engineering a “Is Weekend” feature makes sense.

7. Practical Tips for Feature Engineering


Always document new features and how you created them.

Use visualizations to check if engineered features reveal new patterns.

Test if new features actually improve analysis or model performance.

8. Summary Table: Feature Engineering Examples

55/61
Raw Feature(s) Engineered Feature Why Useful?

Purchase Date Day of Week, Is Weekend Reveal sales patterns

Height, Weight BMI Health analysis

Product Amounts Total Transaction Amount Customer value analysis

Age Age Group Segment customers

Spend, Purchases High Value Flag Targeted marketing

9. Key Takeaways
Feature engineering transforms raw data into more meaningful features.

It leverages domain expertise and creativity.

Well-engineered features can uncover hidden patterns and improve analytics or machine
learning results.

Practice by looking for new features you can create from any dataset you work with.

Tip for Self-Learning:


Take a sample dataset (e.g., transactions with dates and amounts). Try to engineer at least
three new features (like “day of week,” “total amount,” or “is weekend”) and visualize their
impact on your analysis!

Descriptive Analytics and Data Visualization: Detailed Notes


with Examples and Visuals

1. Understanding Descriptive Analytics


Descriptive analytics is the process of summarizing and interpreting historical data to answer
the question: “What happened?” It is the foundation of data analysis, providing the first

56/61
layer of insight from your dataset.

A. Key Techniques in Descriptive Analytics

Summary Statistics:

Mean (Average): The typical value in your dataset.


Example: If your transaction amounts are ₹22, ₹56, and ₹140, the mean is (22 +
56 + 140)/3 = ₹72.67.
Median: The middle value when data is sorted.

Mode: The most frequently occurring value.

Frequency Counts: How many times each value or category appears. Example: If
“banana” is purchased 4 times, its frequency is 4.

Range:

Difference between the largest and smallest values.


Example: If transaction values range from ₹22 to ₹140, the range is 140 − 22 =
₹118.
Sorting and Filtering:

Arranging data in ascending or descending order helps identify minimum and


maximum values quickly.

B. Example: Retail Transactions Dataset

Suppose you have a dataset with these columns:

Date of purchase

Transaction ID

Product ID

Product name

Amount

Sample Data Table:

57/61
Transaction ID Product Amount (₹)

1001 Banana 22

1002 Grapes 56

1003 Potato 140

1004 Banana 22

1005 Potato 56

Frequency Table Example:

Product Frequency

Banana 2

Grapes 1

Potato 2

Insights:

Potatoes and bananas are the most frequently purchased items.

The smallest transaction value is ₹22 (appears twice).

The largest transaction value is ₹140 (appears once).

The average transaction value is (22 + 56 + 140 + 22 + 56)/5 = ₹59.2.

2. Feature Engineering and Data Combination


Feature Engineering:
Creating new variables from existing data to make analysis easier or more insightful.
Example:

Total Transaction Amount: Sum all product amounts in a transaction.

Payment Mode Analysis: Combine transaction data with a payment mode dataset
to analyze trends by payment type.

58/61
Combining Data Sets:
Joining datasets (e.g., transaction details with payment modes) allows for deeper
insights, such as average transaction value by payment method.

Combined Data Table Example:

Transaction ID Total Amount (₹) Payment Mode

1001 22 UPI

1002 56 Cash

1003 140 Credit Card

Average by Payment Mode:

Payment Mode Average Value (₹)

UPI 31.3

Credit Card 113.3

Cash 24.3

Insight: Higher-value transactions are more likely to be paid by credit card.

3. Data Visualization: Turning Numbers into Insights

A. Why Visualize Data?

Makes complex data easier to understand.

Reveals patterns, trends, and outliers that summary statistics may miss.

Communicates findings clearly to others.

B. Elements of a Good Data Visualization

Informative Title: Clearly states what the chart shows.

Labeled Axes: Both axes should be labeled with variable names and units.

Legend: Explains colors, shapes, or patterns used in the chart.

59/61
Aesthetics: Clean, uncluttered, and uses color purposefully.

Example Visual: Frequency of Product Purchases (Bar Chart)

lua

| Frequency
| ■■■■■■■■■■■■■
| ■■■■■■■■■■■■■
| ■■■■■■■
|-------------------------
Potato Banana Grapes

Example Visual: Transaction Amounts by Payment Mode (Bar Chart)

java

| Avg Value (₹)


| ■■■■■■■■■■■■■■■■■■■■■■■■
| ■■■■■■■■■■■
| ■■■■■■
|-------------------------------
Credit UPI Cash
Card

C. The Power of Visualization: Anscombe’s Quartet

Four datasets can have identical summary statistics (mean, median, etc.) but look very
different when graphed.

Lesson: Always visualize your data; it reveals patterns that numbers alone can’t.

4. Practical Example: Step-by-Step Descriptive Analytics


1. Summarize Data:

Calculate mean, median, mode, range.

Count frequency of each product.

2. Sort and Filter:

Find the smallest and largest transaction values.

60/61
3. Combine Datasets:

Join transaction data with payment mode data.

4. Group and Aggregate:

Calculate average transaction value by payment mode.

5. Visualize:

Create bar charts or histograms to display frequencies and averages.

5. Key Takeaways
Descriptive analytics provides the first, essential look at your data.

Use summary statistics and frequencies to understand what happened.

Combine and engineer features for deeper insights.

Visualization is crucial—use clear, well-labeled charts to reveal and communicate


patterns.

Always balance clarity (science) and aesthetics (art) in your visualizations.

Tip for Self-Learning:


Practice by taking a small dataset, calculating summary statistics, creating a frequency table,
and drawing a simple bar chart by hand or in Excel. Then, combine with another dataset (like
payment mode) and see what new insights you can uncover!

61/61

You might also like