💡 What is Data Science?
– Explained Simply
Data Science is a process that uses different techniques to understand and find useful
meaning from data.
It helps us find patterns, relationships, and hidden information in both small and large sets
of data.
These techniques include mathematics, statistics, and computer science to make better
decisions.
The word “science” means it is based on proper steps and past experiences (not just
guesses).
So, Data Science is the smart use of data to solve real-world problems.
⭐ Features of Data Science
1. Finding Meaningful Patterns in Data
It helps us find new, useful, and understandable patterns.
These patterns can be used to make decisions and should work for both:
o The current data
o New or future data
These patterns are found through several steps and are often new or unknown before.
Once we get a useful pattern, it helps users take better actions or improve results.
➤ Tools & Techniques Used:
Tool What It Does Example
Finds patterns using machine learning A bank finds fraud patterns from past
Data Mining
& statistics transactions
A company studies sales increase due
Statistical Analysis Finds relationships and trends in data
to offers
Machine Learning Netflix recommends shows you may
Learns from data to predict or classify
Models like
Data Visualization Shows patterns using graphs/charts Sales chart over time
Chooses best input data to improve Selecting age, income, location to
Feature Engineering
models predict spending
Recognizing handwriting or voice
Pattern Recognition Detects regular and unusual trends
commands
✅ Real-Life Examples:
Grouping online customers by their buying habits.
Detecting fake transactions using spending patterns.
Predicting machine breakdowns in factories.
Finding which items are bought together (like bread & butter).
2. Building Models to Represent the System
Models explain how different things in the system are related.
It helps in understanding how changes in one factor affect others.
Example:
A loan model may show how your income and credit score affect loan approval and interest rates.
Once the model is ready, it can:
Predict outcomes for new cases
Explain how inputs (like age, salary) affect the result (like loan approval)
3. Mix of Statistics, Machine Learning & Computing
Data Science brings together:
o Statistics – to analyze data and find conclusions
o Machine Learning – to build models that learn and predict
o Computing – to handle and process big amounts of data quickly
Example:
Amazon uses all three to:
Store customer data (computing)
Analyze their shopping behavior (statistics)
Suggest products (machine learning)
4. Learning Algorithms (Machine Learning)
These are smart programs that learn from past data and improve over time.
Type Description Examples
Learns from labeled data (with correct Predicting house prices using
Supervised Learning
answers) area, location, etc.
Unsupervised Learns from data without labels; finds Grouping customers by shopping
Learning hidden patterns behavior
Reinforcement Learns by trying things and getting A robot learning to walk or play a
Type Description Examples
Learning rewards/punishments game
Benefits of Learning Algorithms:
Work well with big, complex data
Provide accurate predictions
Get better as they learn more
Used in many fields like health, finance, media, etc.
5. Other Related Fields of Data Science
a. Descriptive Statistics
Summarizes and describes data using averages, percentages, etc.
Helps to understand basic structure of data.
Example:
Finding the average marks of students in a class.
b. Exploratory Visualization
Uses graphs and charts to explore data and find trends.
Helps spot patterns that aren’t clear in numbers.
Example:
A line graph showing monthly sales increase.
c. Dimensional Slicing (OLAP – Online Analytical Processing)
Cuts and filters data based on dimensions like region, time, or category.
Helps in deep analysis by comparing slices of data.
Example:
Sales data sliced by month and region to find the best-performing area.
d. Hypothesis Testing
A method to test ideas or assumptions about data.
Helps to know if a claim about data is true or false.
Steps:
1. Make a statement (like: "more ads increase sales")
2. Pick a test method
3. Check with data
4. Accept or reject the idea based on results
Example:
Testing if a new teaching method improves student scores.
e. Data Engineering
Deals with storing, preparing, and moving data to make it ready for analysis.
It uses big data tools like:
o Hadoop – stores large data
o Spark – processes data fast
o Kafka – handles real-time data transfer
Example:
Setting up systems to collect website visitor data and prepare it for analysis.
f. Business Intelligence (BI)
Uses tools to understand past and current data.
Helps companies take smart decisions using dashboards and reports.
Combining BI with Data Science helps in understanding:
o What happened (past)
o What can happen (future)
Example:
A manager views a sales dashboard to plan next month’s marketing.
DATA SCIENCE – CLASSIFICATION
Data science problems are mainly divided into two types: supervised learning and
unsupervised learning.
Supervised learning means the machine is trained using data that already has answers
(labels).
Each input has a known output.
Unsupervised learning means the machine is given data without any answers.
It tries to find hidden patterns or groups on its own.
Supervised Learning
Just like a teacher helps a student, here we guide the machine using labelled data.
The machine learns from example data and then makes predictions.
Example:
Suppose we have a basket of fruits.
The machine looks at a fruit’s shape, colour, and texture.
It compares it with fruits it already learned about.
If the new fruit looks like an apple, it says, “This is an apple.”
Uses of Supervised Learning:
1. Image classification: Tells if a photo is of a dog, cat, car, etc.
2. Medical diagnosis: Helps doctors detect diseases using reports or scans.
3. Fraud detection: Catches fake transactions in banks.
4. NLP (Natural Language Processing): Used in translation, summarizing text, and
understanding emotions in sentences.
Unsupervised Learning
Here, there is no teacher or label to guide the machine.
The machine just looks at the raw data and tries to find patterns, similarities, or groups.
Example:
If we give animal data without names, the machine might group them based on their
features like size, diet, or behaviour — even if it doesn’t know the names.
Types of Data Science Problems:
1. Classification
2. Regression
3. Association Analysis
4. Clustering
5. Anomaly Detection
6. Recommendation Engines
7. Feature Selection
8. Time Series Forecasting
9. Deep Learning
10. Text Mining
1. Classification
Predicts categories like Yes/No, Spam/Not Spam, etc.
The output is a label, not a number.
Example:
A bank checks a person’s age, income, credit score to decide:
o Output: Yes (Give Loan) or No (Don’t give)
Used in:
Email spam filters
Disease detection (COVID or not)
Predicting student results (Pass/Fail)
2. Regression
Predicts a number (not category).
Example:
A company wants to predict house prices.
o Input: Area, number of rooms, location
o Output: ₹30,00,000
Used in:
Predicting sales
Temperature forecasting
Calculating interest rates
3. Clustering
Groups similar things without using any labels (unsupervised learning).
Example:
An online shop groups customers by buying habits:
o Group 1: Buys baby items
o Group 2: Buys electronics
o Group 3: Buys groceries
Used in:
Market segmentation
Targeting customers
Grouping similar news/articles
Note:
Since the machine finds the groups by itself, we need to understand and name the groups later.
4. Association Analysis (Market Basket Analysis)
Finds which items are often bought together.
Example:
People who buy bread also buy butter.
→ Store puts them together or gives a combo offer.
Used in:
Grocery store combos
Amazon "Frequently bought together"
Suggesting matching products
5. Anomaly Detection (Outlier Detection)
Finds data that looks unusual or suspicious.
Example:
A person in India suddenly uses their credit card in France.
→ May be fraud → System blocks the card.
Used in:
Catching fraud
Detecting network security threats
Health alerts (like abnormal heartbeat)
6. Recommendation Engines
Suggests things based on what you or others like.
Example:
Netflix says: "Because you watched Money Heist..."
Amazon: "Customers who bought this also bought..."
Used in:
Online shopping (Amazon)
Music and movies (Spotify, Netflix)
Food apps (Zomato, Swiggy)
7. Feature Selection
Picks only the important inputs to use in a model.
This makes the model better and faster.
Example:
To predict marks:
o Useful: Hours studied, attendance
o Not useful: Favourite colour
→ "Favourite colour" is removed
Used in:
Making accurate models
Saving time and memory
Ignoring useless data
8. Time Series Forecasting
Predicts future values using past time-based data.
Example:
Shopkeeper checks last 12 months’ sales to guess next month’s sales.
Used in:
Stock market predictions
Weather reports
Planning business needs
9. Deep Learning
A smart method where machines learn from big, complex data like images, videos, and
speech.
It uses neural networks.
Examples:
Phone unlocks using face
Voice assistants understand speech (Google Assistant)
ChatGPT understands and replies like a human
Used in:
Face recognition
Alexa, Siri, Google Assistant
Self-driving cars
10. Text Mining (Text Analytics / NLP)
Helps machines understand and use text like reviews, emails, or chats.
Example:
A company reads 1000 customer reviews to see:
o Are people happy or angry?
o Which product has problems?
Used in:
Understanding customer emotions
Detecting spam messages
Chatbots and virtual assistants
Here is your entire content rewritten in simple, easy-to-understand language — all points are
included and organized neatly with examples where helpful.
1. Generic Data Science Process
Data science is a step-by-step method to find useful patterns and relationships in data. It’s done
through a process that keeps repeating until the best results are found.
Steps in the Process:
1. Understand the Problem – Know what you're trying to solve.
2. Prepare the Data – Collect and clean the data.
3. Build the Model – Create a model using the data.
4. Test the Model – Apply it to see how it performs in real-world-like data.
5. Deploy the Model – Use it in real applications and keep improving it.
CRISP-DM Framework:
CRISP-DM (Cross Industry Standard Process for Data Mining) is the most commonly used
structure for data science.
It was made by companies that worked on data mining.
Other frameworks include:
o SEMMA: Sample, Explore, Modify, Model, Assess (from SAS)
o DMAIC: Define, Measure, Analyze, Improve, Control (used in Six Sigma)
o KDD Process: Selection, Preprocessing, Transformation, Data Mining, Interpretation,
and Evaluation
1. Prior Knowledge
What it Means:
It is the knowledge we already have before starting a data science project.
This includes:
o Business knowledge (e.g., how loans work)
o Data understanding (e.g., what credit scores mean)
Why It’s Important:
Helps ask the right questions.
Helps collect the right data.
1.1 Objective – What is the Goal?
Every project starts with a question.
Example:
A loan company wants to know:
“Can we predict interest rates for a new customer based on their credit score?”
So, the goal is:
“Predict interest rate using credit score.”
A clear objective helps in:
Choosing the right data
Picking the correct model
1.2 Subject Area – Know the Business
Knowing the business context is essential.
Example:
A person with a high credit score → Low interest rate
A person with a low credit score → High interest rate
So, we know:
Credit score affects interest rate
We need to collect both values for past customers
1.3 Understand the Dataset
You must understand:
How the data was collected
If data is complete or missing
If the amount and quality of data are good
Terms:
Dataset: A table (rows = data points, columns = attributes)
Data Point: A single row (e.g., person with credit score 500 and rate 7.31%)
Attribute: A column (e.g., Credit Score)
Label: What we want to predict (e.g., Interest Rate)
Identifier: Just used to identify, not for prediction (e.g., Borrower ID)
Example Table:
Borrower ID Credit Score Interest Rate
01 500 7.31%
02 600 6.70%
11 (New) 625 ??
We use past data to predict the interest rate for a new borrower with a credit score of 625.
1.4 Correlation vs. Causation
Correlation: Two things move together.
o Example: Higher credit score → Lower interest rate.
Causation: One thing directly causes the other.
You must be careful not to mix them.
Wrong Example:
Predicting credit score based on interest rate — doesn’t make business sense.
So, always frame the question correctly using domain knowledge.
2. Data Preparation
Before using data, we must clean and organize it. This step is time-consuming but important.
2.1 Data Exploration
Helps understand what the data looks like.
Tools used:
Descriptive statistics: mean, median, min, max
Graphs: scatter plots, histograms
Example:
As credit score increases, interest rate decreases
2.2 Data Quality
Ensure the data is accurate.
Problems:
Duplicate data
Wrong values (e.g., score = 900, but max is 850)
Typing mistakes
Solutions:
Remove duplicates
Fix or delete wrong values
Replace missing values
2.3 Missing Values
Sometimes, some data is missing.
Solutions:
Fill missing values with average/min/max
Remove rows with missing data (if not many)
Use models that can handle missing data
Example:
If credit score is missing:
Fill with average score
Or remove that row
2.4 Data Types and Conversion
Different data types:
Numeric: Credit score (600)
Categorical: Credit rating (Good, Poor)
Conversion:
“Good” → 600
“Excellent” → 800
Some models only work with numbers, so we need to convert categories to numbers (e.g., binning).
2.5 Transformation
Some models need data on the same scale.
Example:
Credit score is 600
Income is ₹1,00,000
Use normalization to scale both between 0 and 1 so one doesn't dominate the other.
2.6 Outliers
Outliers = very unusual data points.
Example:
Income = ₹1 crore → May be real or error
Height = 1.73 cm → Mistake
Action:
Fix or remove outliers
Outlier detection is also useful in fraud detection.
2.7 Feature Selection
Pick the most useful attributes.
Example:
"Income" and "Tax Paid" may give similar info → Keep one
Reduces model complexity and increases speed.
2.8 Data Sampling
Use a small part of the data that represents the whole dataset. Helps speed up testing and training.
3. Model
A model is a formula that shows how things relate.
Example:
Higher credit score → Lower interest rate
Steps:
1. Training Data – Used to build the model
2. Build Model – Create the formula (e.g., y = a * x + b)
3. Test Data – Check how well the model works
4. Evaluation – Compare actual vs predicted results
5. Final Model – Used in real-world application
3.1 Training and Testing Datasets
Split data into:
Training Data: To build the model
Test Data: To check accuracy
3.2 Learning Algorithms
Choose the right algorithm for the problem.
Examples:
Classification: Pass/Fail
o Use Decision Trees, k-NN, Neural Networks
Regression: Predicting numbers (e.g., interest rate)
Clustering: Grouping customers
Association: Market basket analysis
3.3 Model Evaluation
Check how good your model is.
If it only remembers training data, it's overfitting
It should work well on new, unseen data
Use test data to evaluate.
3.4 Ensemble Modeling
Combine multiple models to get better results.
Example:
Use 3 models (Tree, Logistic, k-NN)
If 2 say “Pass” and 1 says “Fail”, the final result is “Pass”
Types:
Bagging: Builds on random subsets (e.g., Random Forest)
Boosting: Fixes errors in steps (e.g., XGBoost)
Stacking: Combines models using another model
4. Application (Deployment)
Deploy the model to use it in real-world systems.
4.1 Production Readiness
Example:
A bank uses the model to approve loans instantly
Model must be:
o Fast
o Always ready
Or in batch systems (e.g., customer grouping), speed is not critical.
4.2 Technical Integration
Models made in tools like Python or R must be added to apps or software
4.3 Response Time
Some models are fast, some are slow:
Model Build Time Prediction Speed
k-NN Fast Slow
Tree Slow Fast
4.4 Model Refresh
The model must be updated regularly because the world changes.
Set a schedule: monthly, quarterly
Retrain if error is high
4.5 Assimilation
Sometimes, we don’t use code—we just use insights.
Example:
If model shows:
People who buy bread also buy butter
Stores use that info to place them together — no code needed.
5. KNOWLEDGE
Data science is not just about analyzing numbers — it’s about finding valuable insights that
can help solve real-world problems.
Think of it as a smart toolbox that helps you:
o Ask better questions
o Find patterns that are not obvious
o Make smarter decisions
Example: Supermarket Sales
👉 Basic Insight (Normal Data Analysis):
Let’s say a supermarket checks its sales data.
They see that milk and bread sell more on weekends.
This is obvious — and can be found using tools like:
o Excel charts
o Simple reports
👉 Deeper Knowledge (Using Data Science):
To find more useful or hidden patterns, we need advanced tools and techniques, such as:
o Data science algorithms
o Machine learning
o Advanced statistical methods
Key Idea: From Prior Knowledge to Posterior Knowledge
The data science process begins with prior knowledge (what we already know or assume).
It ends with posterior knowledge (new insights we gain after analyzing data).
Example:
Prior: “Maybe people buy butter with bread.”
Posterior: “70% of bread buyers also buy butter on Saturdays between 5–7 PM.”
Prior Knowledge
What we believe before analyzing data
Example:
People who buy milk may also buy butter
This is a guess based on experience or common sense.
Posterior Knowledge
What we find after analyzing data
Example:
70% of milk buyers also buy butter (based on data)
This is a proven insight.
Data Exploration (Exploratory Data Analysis - EDA)
🔹 What is Data?
The word "data" comes from the Latin word dare, which means "something given" — like an
observation or a fact.
Interestingly, the Sanskrit word "dAta" also means “given”.
🔹 What is Data Exploration?
Data Exploration (EDA) is the first step when you start working with a dataset.
It helps you understand the data before using advanced tools like machine learning or
statistics.
🔍 What do we do during Data Exploration?
1. Understand the structure of the data
(What kind of data is there? How is it arranged?)
2. Summarize main features
(What are the key values and stats?)
3. Find patterns, relationships, and outliers
(Is there something unusual or interesting?)
🔹 Why is it important?
It helps you prepare data for further analysis.
You can find issues or insights early that will help in later steps.
🔹 Two Main Types of Data Exploration
1. Descriptive Statistics
→ Using numbers to describe data.
Common terms:
o Mean – average value
o Standard Deviation – how spread out values are
o Correlation – how two values are related
2. Data Visualization
→ Using charts and graphs to "see" the data.
o Like bar charts, pie charts, scatter plots, etc.
o Helps spot trends, patterns, or issues.
Both methods are used together in data science.
🎯 Objectives of Data Exploration
Data exploration isn’t just done at the beginning — it is used throughout the data science process.
✅ 1. Data Understanding
Get a quick summary of:
What columns/attributes are there?
What are the usual (typical) values?
Are there any outliers (extremely high or low values)?
How are attributes related to each other?
📌 Example:
For a house price dataset:
Attribute: Price
Typical value: ₹50 lakhs
Outlier: ₹5 crores — this is way higher than others
Relationship: Bigger houses may have more bedrooms
👉 This step helps you know what kind of data you’re working with before applying algorithms.
✅ 2. Data Preparation
Before using machine learning:
You need to clean and organize your data.
Data exploration helps find:
o Missing values
o Outliers
o Strong correlations (some algorithms don’t work well if attributes are too closely
related)
📌 Example:
In a customer dataset:
Missing value in Age → should we fill it with average?
Outlier: One customer spent ₹10 lakhs while others spent ₹1,000–₹10,000
Correlation: "Height" and "Weight" may be closely related, so using both might be
unnecessary.
👉 You then fix, remove, or adjust such data.
✅ 3. Supporting Data Science Tasks
Sometimes, simple charts can give big insights.
Scatter plots or bar charts may help you:
o Find clusters (groups)
o Visually create rules for classification or prediction
📌 Even without using complex models, you can discover useful patterns.
✅ 4. Interpreting Results
After building your model, data exploration helps you:
Understand the model’s output
Visualize predictions
See error rates or boundaries
📌 Examples:
Histogram: Shows how predicted prices are spread
Error Rate Plot: Shows how far predictions are from actual values
Box Plot: Helps you see which groups were misclassified more often
👉 Without such visuals, model results can be hard to understand.
📊 DATASETS
🔹 What is a Dataset?
A dataset is a collection of related data that is arranged in a structured format, like in a
table (with rows and columns).
It is used in data science for:
o ✅ Analysis (to understand patterns),
o ✅ Modeling (to build predictive models),
o ✅ Training machine learning algorithms,
o ✅ Drawing insights (to make decisions).
🔹 Example: Iris Dataset
The Iris dataset is a famous dataset used in data science and machine learning.
Introduced by Ronald Fisher in 1936 in a scientific paper.
It contains measurements of different types of iris flowers.
📂 Types of Data in a Dataset
Every column in a dataset is also called an attribute or feature, and each has a data type.
🧠 Why is data type important?
It helps you decide:
✅ What kind of operations you can do (math or not)
✅ What kind of visualizations are suitable (chart, graph, etc.)
✅ What algorithms can be used
✅ Whether you need to convert it to a different type
🔸 1. Numeric or Continuous Data
Used to represent measurable quantities (you can do math with them).
a) Continuous Data
Can take infinite values between any two numbers
Often includes decimals
📌 Examples:
Temperature: 32.5°C, 101.45°F
Height: 170.2 cm
b) Integer Data
Only whole numbers (no decimals)
Often used for counting things
📌 Examples:
Number of children: 2, 3
Number of orders: 15
Days below 0°C: 10
🔸 2. Categorical or Nominal Data
Used for labels or names. You can’t do math with them.
a) Nominal Data
No order or ranking — just names or categories
📌 Examples:
Eye colour: Blue, Green, Brown
Gender: Male, Female
b) Ordinal Data (Ordered Nominal)
Has a specific order, but the difference between values is not exact
📌 Examples:
Customer reviews: Poor < Average < Good < Excellent
Education level: High School < UG < PG
🌟 Descriptive Statistics
🔹 What is Descriptive Statistics?
It is the process of summarizing and describing the most important features of a dataset.
Just like you summarize a story into key points, in data science we summarize large data to
make it easier to understand.
✅ Real-life Examples:
Average income in a city
Median house price in an area
Range of marks in a class
Average credit scores of people
🔸 Types of Descriptive Statistics
They are mainly of two types based on how many variables you're analyzing:
1. Univariate Exploration
"Uni" means one → It means studying one variable at a time.
Example: In the Iris dataset, you can check just Sepal Length alone.
Ways to describe a single variable:
🔹 1. Central Tendency
These help to know the center or typical value in your data.
Term Meaning Example
Mean Add all values and divide by
60, 70, 80, 90, 100 → (60+70+80+90+100)/5 = 80
(Average) number of values
Middle value when data is 60, 70, 80, 90, 100 → Median = 8060, 70, 80, 90 →
Median
arranged Median = (70+80)/2 = 75
Mode Most frequent value 80, 80, 70, 90, 100 → Mode = 80
🔹 2. Dispersion (Spread of Data)
These tell us how spread out or different the values are.
Term Meaning Example
Temp: 110°F and 30°F → Range =
Range Highest - Lowest
80°F
Average of squared differences from Tells how far values are from
Variance
mean average
Higher SD = data is more spread
Standard Deviation (SD) Square root of variance
out
Max/Min Highest and lowest value Useful to know extremes
Quartiles / Interquartile
Divide data into parts to study spread Helps spot outliers
Range
🔹 3. Count / Null Count
Count = How many values are present
Null count = How many missing (empty) values are there
🔹 4. Shape of Distribution
Tells how data looks visually:
Term Meaning
Symmetry Data is balanced on both sides of the middle
Skewness Data leans left or right (due to outliers)
Kurtosis How sharp or flat the peak is in the graph
🔹 5. Five-Number Summary
It’s a simple way to describe how the data is spread.
The 5 numbers are:
1. Minimum – Smallest value
2. Q1 (First Quartile) – 25% point
3. Q2 (Median) – Middle value (50%)
4. Q3 (Third Quartile) – 75% point
5. Maximum – Largest value
Example:
Dataset: 4, 7, 9, 10, 15, 18, 21, 25, 29, 35
→
Minimum = 4
Q1 = 9
Median (Q2) = (15+18)/2 = 16.5
Q3 = 25
Maximum = 35
Why it helps:
Quickly shows how data is spread
Detects outliers
Used in box plots
2. Multivariate Exploration
Here, we analyze more than one variable at the same time.
Why it’s useful:
Understand the relationship between different features
Find patterns
Useful for predictive modeling like classification or regression
🔹 Central Data Point (Mean Point)
We can find an average data point by calculating the mean of each attribute.
Example:
o Mean = {5.006, 3.418, 1.464, 0.244}
o This "average flower" may not exist, but gives a typical idea.
🔹 Correlation
Correlation tells how two variables change together.
Example Correlation Type
Temperature ↑ → Ice cream sales ↑ Positive correlation
Temperature ↑ → Jacket sales ↓ Negative correlation
⚠️But remember:
Correlation ≠ Causation
Example:
Ice cream sales and shark attacks both rise in summer.
They’re correlated, but one doesn't cause the other.
The real reason is: summer season.
🔹 Pearson Correlation Coefficient (r)
Value Meaning
+1 Perfect positive relationship
0 No linear relationship
–1 Perfect negative relationship
🔹 Visualization
Graphs should be the first step in checking relationships between variables.
Helps spot:
o Outliers
o Non-linear patterns
Common tool: Scatter plots
Here is your content rewritten in simple, easy-to-understand words — all points are included clearly
and neatly, with examples:
📊 DATA VISUALIZATION (In Simple Words)
✅ What is Data Visualization?
Data visualization means showing data as pictures like charts, graphs, or maps.
It is used to explore and understand big sets of data easily.
It helps to see patterns, trends, or unusual data that are hard to notice in just numbers.
🎯 Why Do We Use Data Visualization?
1. To Understand Large Data
Big data tables are hard to read.
A simple graph can summarize thousands of values.
Example: A line chart showing sales every month for 5 years can easily show if sales are
increasing or decreasing.
2. To Find Relationships
By using X and Y axes (like in graphs), we can check if two things are related.
Example: A scatter plot of study time vs. exam scores can show if studying more gives better
results.
3. Because the Human Brain Likes Visuals
Our brain is good at spotting visual patterns.
So, using charts helps us quickly notice clusters, trends, or unusual points.
1️⃣ Univariate Visualization
(Studying one column or attribute at a time)
These techniques help us understand how the values of one variable are spread out or grouped.
📦 Histogram
A histogram is a type of bar chart that shows how many times a value or a range of values
occurs.
X-axis: Value range (called bins)
Y-axis: Frequency (how often it appears)
Helps us see the shape of data (spread, peaks, etc.)
Example:
Bin 4.0–4.5 cm: How many flowers have petal length in this range?
✅ Stratified Histogram (Iris Dataset Example)
Shows petal length for 3 different species of Iris flower:
Iris setosa (Blue)
o Petal length: ~1.0–1.9 cm
o Shortest petals
o Clearly separate from other types
Iris versicolor (Green)
o Petal length: ~3.0–5.0 cm
o Medium petals
o Overlaps with Iris virginica
Iris virginica (Red)
o Petal length: ~4.5–7.0 cm
o Longest petals
🔢 Quartiles
Quartiles divide the data into 4 equal parts.
It tells us where the middle and spread of the data are.
There are 3 main quartiles:
Q1 (Lower Quartile) – Middle of the first 25% of data
Q2 (Median) – Middle of the full dataset (50%)
Q3 (Upper Quartile) – Middle of the last 25% of data
Each quartile = 25% of the total data.
✍️How to Find Quartiles:
Data Set (Math scores of 19 students in order):
59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98
Step 1: Find Q2 (Median)
o 19 values → 10th value is the median
o Q2 = 75
Step 2: Find Q1 (Lower Quartile)
o First 9 numbers: 59 to 75
o Middle value (5th): Q1 = 68
Step 3: Find Q3 (Upper Quartile)
o Last 9 numbers: 75 to 98
o Middle value (5th): Q3 = 84
Quartile Value
Q1 68
Q2 75
Q3 84
📏 Interquartile Range (IQR)
IQR = Q3 - Q1
Shows the spread of the middle 50% of the data
Example:
IQR = 84 - 68 = 16
Here’s your content rewritten in simple, easy-to-understand language, without removing any points.
I've also added examples for better clarity.
📊 Box Plots (Box and Whisker Plots)
A box plot is a chart used in data analysis to show how values are spread out (distribution)
and if the data is skewed (tilted more to one side).
It displays 5 key summary points of data:
✅ Minimum
✅ First quartile (Q1)
✅ Median
✅ Third quartile (Q3)
✅ Maximum
🔹 Explanation of Each Part:
Minimum Score:
The smallest value in the dataset (ignoring extreme outliers). It's shown at the end of the left
whisker.
Example: If scores are 10, 20, 25, 30, 100, the minimum (excluding outlier 100) is 10.
Lower Quartile (Q1):
25% of the values lie below this point.
Example: In 100 test scores sorted from low to high, the 25th score is Q1.
Median (Q2):
Middle value. 50% of data is above and 50% is below this point.
Example: If scores are 10, 20, 30, 40, 50, the median is 30.
Upper Quartile (Q3):
75% of the values lie below this point.
Example: In 100 test scores, the 75th score is Q3.
Maximum Score:
Highest value (excluding extreme outliers), shown at the end of the right whisker.
Example: In 10, 20, 25, 30, 40, 100 → max is 40 (100 is an outlier).
Whiskers:
Lines that go from Q1 to the minimum, and from Q3 to the maximum. They show the spread
of the lower 25% and upper 25% values.
Interquartile Range (IQR):
The range between Q1 and Q3. It covers the middle 50% of the data.
Formula: IQR = Q3 - Q1
📈 Distribution Chart
Used for continuous numeric attributes like petal length in flowers.
Instead of showing each value, we use a normal distribution curve (also called bell curve).
🔹 Bell Curve Basics:
Looks like a bell (symmetrical around the mean).
Most data points are close to the mean (average).
Fewer data points appear far away (extreme values).
🔹 Formula:
Normal distribution is calculated using a formula with:
o μ (mu) = mean
o σ (sigma) = standard deviation
o x = value being evaluated
🔹 Example:
Petal lengths of three species (I. setosa, I. versicolor, I. virginica) are plotted using three bell
curves.
This helps to compare how different the petal lengths are across species.
🔺 Multivariate Visualization
Multivariate means showing 3 or more variables together to see how they relate.
🔸 Types:
1. Univariate – 1 variable
Example: Just “Study Hours”
2. Bivariate – 2 variables
Example: “Study Hours” vs. “Exam Score”
3. Multivariate – 3+ variables
Example: “Study Hours”, “Attendance”, “Sleep Time”, “Exam Score” together
🔹 Scatterplot
A scatterplot is a graph that shows relationship between two continuous variables using
dots.
Each dot = 1 data point.
🔸 Example:
Want to check if Study Hours affect Exam Score
o X-axis: Study Hours
o Y-axis: Exam Score
What we can observe:
1. Correlation (Relationship)
o Positive correlation: More study → higher marks
o Negative correlation: More study → lower marks
o No pattern: No relation
2. Clusters
o Dots form groups → may show different types of students (e.g., regular vs. irregular)
3. Outliers
o A dot far from the rest.
Example: A student studies 5 hrs/day but scores only 40%. Something may be wrong.
🔹 Scatter Multiple
An upgraded scatterplot that shows more than two dimensions.
🔸 Example:
X-axis: Study Hours
Y-axis: Two variables —
o Exam Score (green dots)
o Attendance (blue dots)
🟢 Useful when comparing multiple features against a single feature like "Study Hours".
🔹 Scatter Matrix
A scatter matrix shows scatterplots for all possible pairs of variables in a dataset.
🔸 Example:
Iris dataset with 4 attributes → 16 mini scatter plots.
Diagonal shows a feature compared with itself.
Color of dots shows species of flower (e.g., Setosa, Versicolor, Virginica).
✅ Good for comparing multiple variables and spotting patterns quickly.
🔹 Bubble Chart
A bubble chart is like a scatterplot but adds one more variable by using bubble size.
🔸 Example (Iris Dataset):
X-axis: Petal Length
Y-axis: Petal Width
Bubble size: Sepal Width
Color: Flower species
🟢 Shows 3 variables at once (X, Y, and size).
🔹 Density Chart
Like a scatterplot but adds background color to show another dimension.
Can show up to 4 dimensions.
🔸 Example:
X-axis: Petal Length
Y-axis: Sepal Length
Color of background: Sepal Width
Color of data point: Flower Species
🟢 Useful for visualizing complex data with multiple features.