Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views31 pages

? What Is Data Science

Data Science

Uploaded by

maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

? What Is Data Science

Data Science

Uploaded by

maria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

💡 What is Data Science?

– Explained Simply

 Data Science is a process that uses different techniques to understand and find useful
meaning from data.

 It helps us find patterns, relationships, and hidden information in both small and large sets
of data.

 These techniques include mathematics, statistics, and computer science to make better
decisions.

 The word “science” means it is based on proper steps and past experiences (not just
guesses).

 So, Data Science is the smart use of data to solve real-world problems.

⭐ Features of Data Science

1. Finding Meaningful Patterns in Data

 It helps us find new, useful, and understandable patterns.

 These patterns can be used to make decisions and should work for both:

o The current data

o New or future data

 These patterns are found through several steps and are often new or unknown before.

 Once we get a useful pattern, it helps users take better actions or improve results.

➤ Tools & Techniques Used:

Tool What It Does Example

Finds patterns using machine learning A bank finds fraud patterns from past
Data Mining
& statistics transactions

A company studies sales increase due


Statistical Analysis Finds relationships and trends in data
to offers

Machine Learning Netflix recommends shows you may


Learns from data to predict or classify
Models like

Data Visualization Shows patterns using graphs/charts Sales chart over time

Chooses best input data to improve Selecting age, income, location to


Feature Engineering
models predict spending

Recognizing handwriting or voice


Pattern Recognition Detects regular and unusual trends
commands

✅ Real-Life Examples:

 Grouping online customers by their buying habits.


 Detecting fake transactions using spending patterns.

 Predicting machine breakdowns in factories.

 Finding which items are bought together (like bread & butter).

2. Building Models to Represent the System

 Models explain how different things in the system are related.

 It helps in understanding how changes in one factor affect others.

Example:
A loan model may show how your income and credit score affect loan approval and interest rates.

Once the model is ready, it can:

 Predict outcomes for new cases

 Explain how inputs (like age, salary) affect the result (like loan approval)

3. Mix of Statistics, Machine Learning & Computing

 Data Science brings together:

o Statistics – to analyze data and find conclusions

o Machine Learning – to build models that learn and predict

o Computing – to handle and process big amounts of data quickly

Example:
Amazon uses all three to:

 Store customer data (computing)

 Analyze their shopping behavior (statistics)

 Suggest products (machine learning)

4. Learning Algorithms (Machine Learning)

These are smart programs that learn from past data and improve over time.

Type Description Examples

Learns from labeled data (with correct Predicting house prices using
Supervised Learning
answers) area, location, etc.

Unsupervised Learns from data without labels; finds Grouping customers by shopping
Learning hidden patterns behavior

Reinforcement Learns by trying things and getting A robot learning to walk or play a
Type Description Examples

Learning rewards/punishments game

Benefits of Learning Algorithms:

 Work well with big, complex data

 Provide accurate predictions

 Get better as they learn more

 Used in many fields like health, finance, media, etc.

5. Other Related Fields of Data Science

a. Descriptive Statistics

 Summarizes and describes data using averages, percentages, etc.

 Helps to understand basic structure of data.

Example:
Finding the average marks of students in a class.

b. Exploratory Visualization

 Uses graphs and charts to explore data and find trends.

 Helps spot patterns that aren’t clear in numbers.

Example:
A line graph showing monthly sales increase.

c. Dimensional Slicing (OLAP – Online Analytical Processing)

 Cuts and filters data based on dimensions like region, time, or category.

 Helps in deep analysis by comparing slices of data.

Example:
Sales data sliced by month and region to find the best-performing area.

d. Hypothesis Testing

 A method to test ideas or assumptions about data.

 Helps to know if a claim about data is true or false.

Steps:

1. Make a statement (like: "more ads increase sales")


2. Pick a test method

3. Check with data

4. Accept or reject the idea based on results

Example:
Testing if a new teaching method improves student scores.

e. Data Engineering

 Deals with storing, preparing, and moving data to make it ready for analysis.

 It uses big data tools like:

o Hadoop – stores large data

o Spark – processes data fast

o Kafka – handles real-time data transfer

Example:
Setting up systems to collect website visitor data and prepare it for analysis.

f. Business Intelligence (BI)

 Uses tools to understand past and current data.

 Helps companies take smart decisions using dashboards and reports.

 Combining BI with Data Science helps in understanding:

o What happened (past)

o What can happen (future)

Example:
A manager views a sales dashboard to plan next month’s marketing.

DATA SCIENCE – CLASSIFICATION

 Data science problems are mainly divided into two types: supervised learning and
unsupervised learning.

 Supervised learning means the machine is trained using data that already has answers
(labels).
Each input has a known output.

 Unsupervised learning means the machine is given data without any answers.
It tries to find hidden patterns or groups on its own.
Supervised Learning

 Just like a teacher helps a student, here we guide the machine using labelled data.

 The machine learns from example data and then makes predictions.

Example:

 Suppose we have a basket of fruits.

 The machine looks at a fruit’s shape, colour, and texture.

 It compares it with fruits it already learned about.

 If the new fruit looks like an apple, it says, “This is an apple.”

Uses of Supervised Learning:

1. Image classification: Tells if a photo is of a dog, cat, car, etc.

2. Medical diagnosis: Helps doctors detect diseases using reports or scans.

3. Fraud detection: Catches fake transactions in banks.

4. NLP (Natural Language Processing): Used in translation, summarizing text, and


understanding emotions in sentences.

Unsupervised Learning

 Here, there is no teacher or label to guide the machine.

 The machine just looks at the raw data and tries to find patterns, similarities, or groups.

Example:

 If we give animal data without names, the machine might group them based on their
features like size, diet, or behaviour — even if it doesn’t know the names.

Types of Data Science Problems:

1. Classification

2. Regression

3. Association Analysis

4. Clustering

5. Anomaly Detection

6. Recommendation Engines

7. Feature Selection

8. Time Series Forecasting

9. Deep Learning
10. Text Mining

1. Classification

 Predicts categories like Yes/No, Spam/Not Spam, etc.

 The output is a label, not a number.

Example:

 A bank checks a person’s age, income, credit score to decide:

o Output: Yes (Give Loan) or No (Don’t give)

Used in:

 Email spam filters

 Disease detection (COVID or not)

 Predicting student results (Pass/Fail)

2. Regression

 Predicts a number (not category).

Example:

 A company wants to predict house prices.

o Input: Area, number of rooms, location

o Output: ₹30,00,000

Used in:

 Predicting sales

 Temperature forecasting

 Calculating interest rates

3. Clustering

 Groups similar things without using any labels (unsupervised learning).

Example:

 An online shop groups customers by buying habits:

o Group 1: Buys baby items

o Group 2: Buys electronics

o Group 3: Buys groceries


Used in:

 Market segmentation

 Targeting customers

 Grouping similar news/articles

Note:
Since the machine finds the groups by itself, we need to understand and name the groups later.

4. Association Analysis (Market Basket Analysis)

 Finds which items are often bought together.

Example:

 People who buy bread also buy butter.


→ Store puts them together or gives a combo offer.

Used in:

 Grocery store combos

 Amazon "Frequently bought together"

 Suggesting matching products

5. Anomaly Detection (Outlier Detection)

 Finds data that looks unusual or suspicious.

Example:

 A person in India suddenly uses their credit card in France.


→ May be fraud → System blocks the card.

Used in:

 Catching fraud

 Detecting network security threats

 Health alerts (like abnormal heartbeat)

6. Recommendation Engines

 Suggests things based on what you or others like.

Example:

 Netflix says: "Because you watched Money Heist..."

 Amazon: "Customers who bought this also bought..."


Used in:

 Online shopping (Amazon)

 Music and movies (Spotify, Netflix)

 Food apps (Zomato, Swiggy)

7. Feature Selection

 Picks only the important inputs to use in a model.


This makes the model better and faster.

Example:

 To predict marks:

o Useful: Hours studied, attendance

o Not useful: Favourite colour


→ "Favourite colour" is removed

Used in:

 Making accurate models

 Saving time and memory

 Ignoring useless data

8. Time Series Forecasting

 Predicts future values using past time-based data.

Example:

 Shopkeeper checks last 12 months’ sales to guess next month’s sales.

Used in:

 Stock market predictions

 Weather reports

 Planning business needs

9. Deep Learning

 A smart method where machines learn from big, complex data like images, videos, and
speech.
It uses neural networks.

Examples:
 Phone unlocks using face

 Voice assistants understand speech (Google Assistant)

 ChatGPT understands and replies like a human

Used in:

 Face recognition

 Alexa, Siri, Google Assistant

 Self-driving cars

10. Text Mining (Text Analytics / NLP)

 Helps machines understand and use text like reviews, emails, or chats.

Example:

 A company reads 1000 customer reviews to see:

o Are people happy or angry?

o Which product has problems?

Used in:

 Understanding customer emotions

 Detecting spam messages

 Chatbots and virtual assistants

Here is your entire content rewritten in simple, easy-to-understand language — all points are
included and organized neatly with examples where helpful.

1. Generic Data Science Process

Data science is a step-by-step method to find useful patterns and relationships in data. It’s done
through a process that keeps repeating until the best results are found.

Steps in the Process:

1. Understand the Problem – Know what you're trying to solve.

2. Prepare the Data – Collect and clean the data.

3. Build the Model – Create a model using the data.

4. Test the Model – Apply it to see how it performs in real-world-like data.

5. Deploy the Model – Use it in real applications and keep improving it.

CRISP-DM Framework:
 CRISP-DM (Cross Industry Standard Process for Data Mining) is the most commonly used
structure for data science.

 It was made by companies that worked on data mining.

 Other frameworks include:

o SEMMA: Sample, Explore, Modify, Model, Assess (from SAS)

o DMAIC: Define, Measure, Analyze, Improve, Control (used in Six Sigma)

o KDD Process: Selection, Preprocessing, Transformation, Data Mining, Interpretation,


and Evaluation

1. Prior Knowledge

What it Means:

 It is the knowledge we already have before starting a data science project.

 This includes:

o Business knowledge (e.g., how loans work)

o Data understanding (e.g., what credit scores mean)

Why It’s Important:

 Helps ask the right questions.

 Helps collect the right data.

1.1 Objective – What is the Goal?

Every project starts with a question.

Example:
A loan company wants to know:

“Can we predict interest rates for a new customer based on their credit score?”

So, the goal is:

“Predict interest rate using credit score.”

A clear objective helps in:

 Choosing the right data

 Picking the correct model

1.2 Subject Area – Know the Business

Knowing the business context is essential.


Example:

 A person with a high credit score → Low interest rate

 A person with a low credit score → High interest rate

So, we know:

 Credit score affects interest rate

 We need to collect both values for past customers

1.3 Understand the Dataset

You must understand:

 How the data was collected

 If data is complete or missing

 If the amount and quality of data are good

Terms:

 Dataset: A table (rows = data points, columns = attributes)

 Data Point: A single row (e.g., person with credit score 500 and rate 7.31%)

 Attribute: A column (e.g., Credit Score)

 Label: What we want to predict (e.g., Interest Rate)

 Identifier: Just used to identify, not for prediction (e.g., Borrower ID)

Example Table:

Borrower ID Credit Score Interest Rate

01 500 7.31%

02 600 6.70%

11 (New) 625 ??

We use past data to predict the interest rate for a new borrower with a credit score of 625.

1.4 Correlation vs. Causation

 Correlation: Two things move together.

o Example: Higher credit score → Lower interest rate.

 Causation: One thing directly causes the other.

You must be careful not to mix them.

Wrong Example:
 Predicting credit score based on interest rate — doesn’t make business sense.

So, always frame the question correctly using domain knowledge.

2. Data Preparation

Before using data, we must clean and organize it. This step is time-consuming but important.

2.1 Data Exploration

Helps understand what the data looks like.

Tools used:

 Descriptive statistics: mean, median, min, max

 Graphs: scatter plots, histograms

Example:

As credit score increases, interest rate decreases

2.2 Data Quality

Ensure the data is accurate.

Problems:

 Duplicate data

 Wrong values (e.g., score = 900, but max is 850)

 Typing mistakes

Solutions:

 Remove duplicates

 Fix or delete wrong values

 Replace missing values

2.3 Missing Values

Sometimes, some data is missing.

Solutions:

 Fill missing values with average/min/max

 Remove rows with missing data (if not many)

 Use models that can handle missing data


Example:
If credit score is missing:

 Fill with average score

 Or remove that row

2.4 Data Types and Conversion

Different data types:

 Numeric: Credit score (600)

 Categorical: Credit rating (Good, Poor)

Conversion:

 “Good” → 600

 “Excellent” → 800

Some models only work with numbers, so we need to convert categories to numbers (e.g., binning).

2.5 Transformation

Some models need data on the same scale.

Example:

 Credit score is 600

 Income is ₹1,00,000

Use normalization to scale both between 0 and 1 so one doesn't dominate the other.

2.6 Outliers

Outliers = very unusual data points.

Example:

 Income = ₹1 crore → May be real or error

 Height = 1.73 cm → Mistake

Action:

 Fix or remove outliers

Outlier detection is also useful in fraud detection.

2.7 Feature Selection


Pick the most useful attributes.

Example:

 "Income" and "Tax Paid" may give similar info → Keep one

Reduces model complexity and increases speed.

2.8 Data Sampling

Use a small part of the data that represents the whole dataset. Helps speed up testing and training.

3. Model

A model is a formula that shows how things relate.

Example:

Higher credit score → Lower interest rate

Steps:

1. Training Data – Used to build the model

2. Build Model – Create the formula (e.g., y = a * x + b)

3. Test Data – Check how well the model works

4. Evaluation – Compare actual vs predicted results

5. Final Model – Used in real-world application

3.1 Training and Testing Datasets

Split data into:

 Training Data: To build the model

 Test Data: To check accuracy

3.2 Learning Algorithms

Choose the right algorithm for the problem.

Examples:

 Classification: Pass/Fail

o Use Decision Trees, k-NN, Neural Networks

 Regression: Predicting numbers (e.g., interest rate)

 Clustering: Grouping customers


 Association: Market basket analysis

3.3 Model Evaluation

Check how good your model is.

 If it only remembers training data, it's overfitting

 It should work well on new, unseen data

Use test data to evaluate.

3.4 Ensemble Modeling

Combine multiple models to get better results.

Example:

 Use 3 models (Tree, Logistic, k-NN)

 If 2 say “Pass” and 1 says “Fail”, the final result is “Pass”

Types:

 Bagging: Builds on random subsets (e.g., Random Forest)

 Boosting: Fixes errors in steps (e.g., XGBoost)

 Stacking: Combines models using another model

4. Application (Deployment)

Deploy the model to use it in real-world systems.

4.1 Production Readiness

Example:

 A bank uses the model to approve loans instantly

 Model must be:

o Fast

o Always ready

Or in batch systems (e.g., customer grouping), speed is not critical.

4.2 Technical Integration

 Models made in tools like Python or R must be added to apps or software


4.3 Response Time

Some models are fast, some are slow:

Model Build Time Prediction Speed

k-NN Fast Slow

Tree Slow Fast

4.4 Model Refresh

The model must be updated regularly because the world changes.

 Set a schedule: monthly, quarterly

 Retrain if error is high

4.5 Assimilation

Sometimes, we don’t use code—we just use insights.

Example:
If model shows:

People who buy bread also buy butter

Stores use that info to place them together — no code needed.

5. KNOWLEDGE

 Data science is not just about analyzing numbers — it’s about finding valuable insights that
can help solve real-world problems.

 Think of it as a smart toolbox that helps you:

o Ask better questions

o Find patterns that are not obvious

o Make smarter decisions

Example: Supermarket Sales

👉 Basic Insight (Normal Data Analysis):

 Let’s say a supermarket checks its sales data.

 They see that milk and bread sell more on weekends.


 This is obvious — and can be found using tools like:

o Excel charts

o Simple reports

👉 Deeper Knowledge (Using Data Science):

 To find more useful or hidden patterns, we need advanced tools and techniques, such as:

o Data science algorithms

o Machine learning

o Advanced statistical methods

Key Idea: From Prior Knowledge to Posterior Knowledge

 The data science process begins with prior knowledge (what we already know or assume).

 It ends with posterior knowledge (new insights we gain after analyzing data).

Example:

 Prior: “Maybe people buy butter with bread.”

 Posterior: “70% of bread buyers also buy butter on Saturdays between 5–7 PM.”

Prior Knowledge

What we believe before analyzing data

Example:

People who buy milk may also buy butter

This is a guess based on experience or common sense.

Posterior Knowledge

What we find after analyzing data

Example:

70% of milk buyers also buy butter (based on data)

This is a proven insight.

Data Exploration (Exploratory Data Analysis - EDA)

🔹 What is Data?
 The word "data" comes from the Latin word dare, which means "something given" — like an
observation or a fact.

 Interestingly, the Sanskrit word "dAta" also means “given”.

🔹 What is Data Exploration?

 Data Exploration (EDA) is the first step when you start working with a dataset.

 It helps you understand the data before using advanced tools like machine learning or
statistics.

🔍 What do we do during Data Exploration?

1. Understand the structure of the data


(What kind of data is there? How is it arranged?)

2. Summarize main features


(What are the key values and stats?)

3. Find patterns, relationships, and outliers


(Is there something unusual or interesting?)

🔹 Why is it important?

 It helps you prepare data for further analysis.

 You can find issues or insights early that will help in later steps.

🔹 Two Main Types of Data Exploration

1. Descriptive Statistics
→ Using numbers to describe data.
Common terms:

o Mean – average value

o Standard Deviation – how spread out values are

o Correlation – how two values are related

2. Data Visualization
→ Using charts and graphs to "see" the data.

o Like bar charts, pie charts, scatter plots, etc.

o Helps spot trends, patterns, or issues.

Both methods are used together in data science.


🎯 Objectives of Data Exploration

Data exploration isn’t just done at the beginning — it is used throughout the data science process.

✅ 1. Data Understanding

Get a quick summary of:

 What columns/attributes are there?

 What are the usual (typical) values?

 Are there any outliers (extremely high or low values)?

 How are attributes related to each other?

📌 Example:
For a house price dataset:

 Attribute: Price

 Typical value: ₹50 lakhs

 Outlier: ₹5 crores — this is way higher than others

 Relationship: Bigger houses may have more bedrooms

👉 This step helps you know what kind of data you’re working with before applying algorithms.

✅ 2. Data Preparation

Before using machine learning:

 You need to clean and organize your data.

 Data exploration helps find:

o Missing values

o Outliers

o Strong correlations (some algorithms don’t work well if attributes are too closely
related)

📌 Example:
In a customer dataset:

 Missing value in Age → should we fill it with average?

 Outlier: One customer spent ₹10 lakhs while others spent ₹1,000–₹10,000

 Correlation: "Height" and "Weight" may be closely related, so using both might be
unnecessary.

👉 You then fix, remove, or adjust such data.


✅ 3. Supporting Data Science Tasks

Sometimes, simple charts can give big insights.

 Scatter plots or bar charts may help you:

o Find clusters (groups)

o Visually create rules for classification or prediction

📌 Even without using complex models, you can discover useful patterns.

✅ 4. Interpreting Results

After building your model, data exploration helps you:

 Understand the model’s output

 Visualize predictions

 See error rates or boundaries

📌 Examples:

 Histogram: Shows how predicted prices are spread

 Error Rate Plot: Shows how far predictions are from actual values

 Box Plot: Helps you see which groups were misclassified more often

👉 Without such visuals, model results can be hard to understand.

📊 DATASETS

🔹 What is a Dataset?

 A dataset is a collection of related data that is arranged in a structured format, like in a


table (with rows and columns).

 It is used in data science for:

o ✅ Analysis (to understand patterns),

o ✅ Modeling (to build predictive models),

o ✅ Training machine learning algorithms,

o ✅ Drawing insights (to make decisions).

🔹 Example: Iris Dataset

 The Iris dataset is a famous dataset used in data science and machine learning.
 Introduced by Ronald Fisher in 1936 in a scientific paper.

 It contains measurements of different types of iris flowers.

📂 Types of Data in a Dataset

Every column in a dataset is also called an attribute or feature, and each has a data type.

🧠 Why is data type important?

It helps you decide:

 ✅ What kind of operations you can do (math or not)

 ✅ What kind of visualizations are suitable (chart, graph, etc.)

 ✅ What algorithms can be used

 ✅ Whether you need to convert it to a different type

🔸 1. Numeric or Continuous Data

Used to represent measurable quantities (you can do math with them).

a) Continuous Data

 Can take infinite values between any two numbers

 Often includes decimals

📌 Examples:

 Temperature: 32.5°C, 101.45°F

 Height: 170.2 cm

b) Integer Data

 Only whole numbers (no decimals)

 Often used for counting things

📌 Examples:

 Number of children: 2, 3

 Number of orders: 15

 Days below 0°C: 10

🔸 2. Categorical or Nominal Data

Used for labels or names. You can’t do math with them.

a) Nominal Data
 No order or ranking — just names or categories

📌 Examples:

 Eye colour: Blue, Green, Brown

 Gender: Male, Female

b) Ordinal Data (Ordered Nominal)

 Has a specific order, but the difference between values is not exact

📌 Examples:

 Customer reviews: Poor < Average < Good < Excellent

 Education level: High School < UG < PG

🌟 Descriptive Statistics

🔹 What is Descriptive Statistics?

 It is the process of summarizing and describing the most important features of a dataset.

 Just like you summarize a story into key points, in data science we summarize large data to
make it easier to understand.

✅ Real-life Examples:

 Average income in a city

 Median house price in an area

 Range of marks in a class

 Average credit scores of people

🔸 Types of Descriptive Statistics

They are mainly of two types based on how many variables you're analyzing:

1. Univariate Exploration

 "Uni" means one → It means studying one variable at a time.

 Example: In the Iris dataset, you can check just Sepal Length alone.

Ways to describe a single variable:

🔹 1. Central Tendency

These help to know the center or typical value in your data.


Term Meaning Example

Mean Add all values and divide by


60, 70, 80, 90, 100 → (60+70+80+90+100)/5 = 80
(Average) number of values

Middle value when data is 60, 70, 80, 90, 100 → Median = 8060, 70, 80, 90 →
Median
arranged Median = (70+80)/2 = 75

Mode Most frequent value 80, 80, 70, 90, 100 → Mode = 80

🔹 2. Dispersion (Spread of Data)

These tell us how spread out or different the values are.

Term Meaning Example

Temp: 110°F and 30°F → Range =


Range Highest - Lowest
80°F

Average of squared differences from Tells how far values are from
Variance
mean average

Higher SD = data is more spread


Standard Deviation (SD) Square root of variance
out

Max/Min Highest and lowest value Useful to know extremes

Quartiles / Interquartile
Divide data into parts to study spread Helps spot outliers
Range

🔹 3. Count / Null Count

 Count = How many values are present

 Null count = How many missing (empty) values are there

🔹 4. Shape of Distribution

Tells how data looks visually:

Term Meaning

Symmetry Data is balanced on both sides of the middle

Skewness Data leans left or right (due to outliers)

Kurtosis How sharp or flat the peak is in the graph

🔹 5. Five-Number Summary
It’s a simple way to describe how the data is spread.

The 5 numbers are:

1. Minimum – Smallest value

2. Q1 (First Quartile) – 25% point

3. Q2 (Median) – Middle value (50%)

4. Q3 (Third Quartile) – 75% point

5. Maximum – Largest value

Example:
Dataset: 4, 7, 9, 10, 15, 18, 21, 25, 29, 35

 Minimum = 4

 Q1 = 9

 Median (Q2) = (15+18)/2 = 16.5

 Q3 = 25

 Maximum = 35

Why it helps:

 Quickly shows how data is spread

 Detects outliers

 Used in box plots

2. Multivariate Exploration

 Here, we analyze more than one variable at the same time.

Why it’s useful:

 Understand the relationship between different features

 Find patterns

 Useful for predictive modeling like classification or regression

🔹 Central Data Point (Mean Point)

 We can find an average data point by calculating the mean of each attribute.

 Example:

o Mean = {5.006, 3.418, 1.464, 0.244}

o This "average flower" may not exist, but gives a typical idea.
🔹 Correlation

 Correlation tells how two variables change together.

Example Correlation Type

Temperature ↑ → Ice cream sales ↑ Positive correlation

Temperature ↑ → Jacket sales ↓ Negative correlation

⚠️But remember:

Correlation ≠ Causation

Example:

 Ice cream sales and shark attacks both rise in summer.

 They’re correlated, but one doesn't cause the other.

 The real reason is: summer season.

🔹 Pearson Correlation Coefficient (r)

Value Meaning

+1 Perfect positive relationship

0 No linear relationship

–1 Perfect negative relationship

🔹 Visualization

 Graphs should be the first step in checking relationships between variables.

 Helps spot:

o Outliers

o Non-linear patterns

Common tool: Scatter plots

Here is your content rewritten in simple, easy-to-understand words — all points are included clearly
and neatly, with examples:

📊 DATA VISUALIZATION (In Simple Words)

✅ What is Data Visualization?


 Data visualization means showing data as pictures like charts, graphs, or maps.

 It is used to explore and understand big sets of data easily.

 It helps to see patterns, trends, or unusual data that are hard to notice in just numbers.

🎯 Why Do We Use Data Visualization?

1. To Understand Large Data

 Big data tables are hard to read.

 A simple graph can summarize thousands of values.

 Example: A line chart showing sales every month for 5 years can easily show if sales are
increasing or decreasing.

2. To Find Relationships

 By using X and Y axes (like in graphs), we can check if two things are related.

 Example: A scatter plot of study time vs. exam scores can show if studying more gives better
results.

3. Because the Human Brain Likes Visuals

 Our brain is good at spotting visual patterns.

 So, using charts helps us quickly notice clusters, trends, or unusual points.

1️⃣ Univariate Visualization

(Studying one column or attribute at a time)

These techniques help us understand how the values of one variable are spread out or grouped.

📦 Histogram

 A histogram is a type of bar chart that shows how many times a value or a range of values
occurs.

 X-axis: Value range (called bins)

 Y-axis: Frequency (how often it appears)

 Helps us see the shape of data (spread, peaks, etc.)

Example:

 Bin 4.0–4.5 cm: How many flowers have petal length in this range?

✅ Stratified Histogram (Iris Dataset Example)

Shows petal length for 3 different species of Iris flower:


 Iris setosa (Blue)

o Petal length: ~1.0–1.9 cm

o Shortest petals

o Clearly separate from other types

 Iris versicolor (Green)

o Petal length: ~3.0–5.0 cm

o Medium petals

o Overlaps with Iris virginica

 Iris virginica (Red)

o Petal length: ~4.5–7.0 cm

o Longest petals

🔢 Quartiles

 Quartiles divide the data into 4 equal parts.

 It tells us where the middle and spread of the data are.

There are 3 main quartiles:

 Q1 (Lower Quartile) – Middle of the first 25% of data

 Q2 (Median) – Middle of the full dataset (50%)

 Q3 (Upper Quartile) – Middle of the last 25% of data

Each quartile = 25% of the total data.

✍️How to Find Quartiles:

Data Set (Math scores of 19 students in order):


59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98

 Step 1: Find Q2 (Median)

o 19 values → 10th value is the median

o Q2 = 75

 Step 2: Find Q1 (Lower Quartile)

o First 9 numbers: 59 to 75

o Middle value (5th): Q1 = 68

 Step 3: Find Q3 (Upper Quartile)

o Last 9 numbers: 75 to 98
o Middle value (5th): Q3 = 84

Quartile Value

Q1 68

Q2 75

Q3 84

📏 Interquartile Range (IQR)

 IQR = Q3 - Q1

 Shows the spread of the middle 50% of the data

Example:
IQR = 84 - 68 = 16

Here’s your content rewritten in simple, easy-to-understand language, without removing any points.
I've also added examples for better clarity.

📊 Box Plots (Box and Whisker Plots)

 A box plot is a chart used in data analysis to show how values are spread out (distribution)
and if the data is skewed (tilted more to one side).

 It displays 5 key summary points of data:


✅ Minimum
✅ First quartile (Q1)
✅ Median
✅ Third quartile (Q3)
✅ Maximum

🔹 Explanation of Each Part:

 Minimum Score:
The smallest value in the dataset (ignoring extreme outliers). It's shown at the end of the left
whisker.
Example: If scores are 10, 20, 25, 30, 100, the minimum (excluding outlier 100) is 10.

 Lower Quartile (Q1):


25% of the values lie below this point.
Example: In 100 test scores sorted from low to high, the 25th score is Q1.

 Median (Q2):
Middle value. 50% of data is above and 50% is below this point.
Example: If scores are 10, 20, 30, 40, 50, the median is 30.
 Upper Quartile (Q3):
75% of the values lie below this point.
Example: In 100 test scores, the 75th score is Q3.

 Maximum Score:
Highest value (excluding extreme outliers), shown at the end of the right whisker.
Example: In 10, 20, 25, 30, 40, 100 → max is 40 (100 is an outlier).

 Whiskers:
Lines that go from Q1 to the minimum, and from Q3 to the maximum. They show the spread
of the lower 25% and upper 25% values.

 Interquartile Range (IQR):


The range between Q1 and Q3. It covers the middle 50% of the data.
Formula: IQR = Q3 - Q1

📈 Distribution Chart

 Used for continuous numeric attributes like petal length in flowers.

 Instead of showing each value, we use a normal distribution curve (also called bell curve).

🔹 Bell Curve Basics:

 Looks like a bell (symmetrical around the mean).

 Most data points are close to the mean (average).

 Fewer data points appear far away (extreme values).

🔹 Formula:

 Normal distribution is calculated using a formula with:

o μ (mu) = mean

o σ (sigma) = standard deviation

o x = value being evaluated

🔹 Example:

 Petal lengths of three species (I. setosa, I. versicolor, I. virginica) are plotted using three bell
curves.

 This helps to compare how different the petal lengths are across species.

🔺 Multivariate Visualization

 Multivariate means showing 3 or more variables together to see how they relate.

🔸 Types:
1. Univariate – 1 variable
Example: Just “Study Hours”

2. Bivariate – 2 variables
Example: “Study Hours” vs. “Exam Score”

3. Multivariate – 3+ variables
Example: “Study Hours”, “Attendance”, “Sleep Time”, “Exam Score” together

🔹 Scatterplot

 A scatterplot is a graph that shows relationship between two continuous variables using
dots.

 Each dot = 1 data point.

🔸 Example:

 Want to check if Study Hours affect Exam Score

o X-axis: Study Hours

o Y-axis: Exam Score

What we can observe:

1. Correlation (Relationship)

o Positive correlation: More study → higher marks

o Negative correlation: More study → lower marks

o No pattern: No relation

2. Clusters

o Dots form groups → may show different types of students (e.g., regular vs. irregular)

3. Outliers

o A dot far from the rest.


Example: A student studies 5 hrs/day but scores only 40%. Something may be wrong.

🔹 Scatter Multiple

 An upgraded scatterplot that shows more than two dimensions.

🔸 Example:

 X-axis: Study Hours

 Y-axis: Two variables —

o Exam Score (green dots)

o Attendance (blue dots)


🟢 Useful when comparing multiple features against a single feature like "Study Hours".

🔹 Scatter Matrix

 A scatter matrix shows scatterplots for all possible pairs of variables in a dataset.

🔸 Example:

 Iris dataset with 4 attributes → 16 mini scatter plots.

 Diagonal shows a feature compared with itself.

 Color of dots shows species of flower (e.g., Setosa, Versicolor, Virginica).

✅ Good for comparing multiple variables and spotting patterns quickly.

🔹 Bubble Chart

 A bubble chart is like a scatterplot but adds one more variable by using bubble size.

🔸 Example (Iris Dataset):

 X-axis: Petal Length

 Y-axis: Petal Width

 Bubble size: Sepal Width

 Color: Flower species

🟢 Shows 3 variables at once (X, Y, and size).

🔹 Density Chart

 Like a scatterplot but adds background color to show another dimension.

 Can show up to 4 dimensions.

🔸 Example:

 X-axis: Petal Length

 Y-axis: Sepal Length

 Color of background: Sepal Width

 Color of data point: Flower Species

🟢 Useful for visualizing complex data with multiple features.

You might also like