Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views47 pages

Edaaaa

Uploaded by

hhh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views47 pages

Edaaaa

Uploaded by

hhh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

SJB Institute of Technology

AN AUTONOMOUS INSTITUTE UNDER VISVESVARAYA TECHNOLOGICAL UNIVERSITY


No.67, BGS Health & Education City, Dr. Vishnuvardhana Rd, Kengeri, Bengaluru, Karnataka, India.

Report
On

EDA Case Study

Title: "Online Shopping analysis"


SUBMITTED IN THE FULFILLMENT FOR THE AWARD OF THE DEGREE OF

BACHELOR OF ENGINEERING
In
COMPUTER SCIENCE AND ENGINEERING

By:
ROHAN N L [1JB23CS125]
MITHUN M [1JB23CS089]
PRATHVI RAJ [1JB23CS1]
NIKHIL K P [1JB24CS408]
LIKN N [1JB24CS406]
PARTHA H C [1JB24CS409]

UNDER THE GUIDANCE OF


Dr. Prakruthi M K
[Assistant Professor]
Dept. of CSE, SJBIT.

1
Part 1

1.Introduction to the Dataset


1.1 The main objectives of analyzing this dataset:
1. Objective of the Analysis

● We’re analyzing this online shopping dataset to uncover patterns in customer


behavior, pricing, and satisfaction. The goal is to provide actionable insights that
businesses can use to improve sales, marketing, and customer experience.

2. Understanding Customer Demographics

● Second, we must understand who is purchasing these goods. The data set has age,
gender, and location, so we can segment shopping behavior into various segments. Do
younger shoppers like some product categories? Are rural and urban shoppers
spending differently? This allows companies to better focus their marketing.

3. Analyzing Pricing and Discounts

● Price is a major factor in purchasing decisions. We’ll check whether higher-priced


items sell less, or if discounts actually boost sales. Are certain demographics more
price-sensitive? If discounts lead to more purchases but lower profits, businesses
might need to rethink their strategy.

4. Analyzing Customer Reviews & Satisfaction

● Not all products have ratings or reviews—why? Are customers less likely to review
items they dislike? We’ll also see if shipping speed or product category affects
ratings. If fast shipping leads to better reviews, companies might prioritize logistics
improvements.

5. Cleaning and Preparing the Data

● Real-world data is messy, so we’ll fix missing values, remove duplicates, and
standardize text (like making all reviews lowercase). Clean data means more accurate
insights—no one wants skewed results because of typos or blank entries.

2
6. Providing Business Recommendations

● Lastly, we'll translate findings into actionable recommendations. If data indicate that
customers dislike slow shipping, we'll recommend quicker delivery methods. If
specific products are unpopular, we could advise discounts or improved marketing.
The aim is to enable companies to make wiser decisions with the help of actual
customer behavior.

7. Why This Matters

● Businesses apply this type of analysis daily to remain competitive. Through


knowledge of trends in shopping behavior, price, and satisfaction, they are able to
maximize strategies—whether that means tweaking prices, streamlining shipping, or
more effectively targeting ads. This is not a school assignment; this is how actual
businesses expand.

1.2 An overview of the key columns/variables in the dataset:

Dataset Overview: Key Columns and Variables


This online shopping data set includes 1,000+ synthetic purchase history with the following
key columns:

1. Customer Information

● CustomerID: Individual shopper's unique identifier


● CustomerAge: Buyer's age (range: 18–70 years)
● CustomerGender: Male/Female/Other
● CustomerLocation: Urban/Suburban/Rural

This Enables segmentation of shoppers based on demographics to personalize marketing.

2. Product Details

● ProductCategory: Electronics/Clothing/Books/Home & Kitchen/Beauty


● ProductPrice: Price of each product
● DiscountApplied: True/False (whether or not a discount was applied)

This Unveils which categories influence purchases and how prices influence buying
decisions.

3
3. Purchase Logistics

● PurchaseDate: Transaction date (YYYY-MM-DD)


● ShippingMethod: Standard/Express/NaN (missing data)

This Determines whether shipping speed influences customer satisfaction.

4. Customer Feedback

● Rating: Product rating (1–5 stars, with missing values)


● Review: Text feedback (e.g., "Good product", "Poor quality", or missing)
● This Captures satisfaction and identifies product problems.

5. Metadata

● RecordID: Each transaction's unique ID


● Missing Data: There are gaps in some columns (Rating, Review, ShippingMethod).

2.1 Significance of the Dataset:


Real-Life Relevance of the Dataset

The synthetic dataset mimics real-life e-commerce data, assisting organizations in knowing
more about their consumers and making fact-based decisions. It gives an insight into
consumer preferences, price strategies, and operational optimizations directly affecting sales
and satisfaction.

1. Increasing Customer Targeting


● Demographic information (age, gender, area) identifies patterns of purchases,
allowing for customized marketing. Companies can address advertising and
promotions to individual segments, such as targeting electronics promotions to urban
millennials or home furnishings to rural older consumers.

2. Optimizing Pricing Strategies


● The data demonstrates the impact of price and discounts on sales. Firms can
determine what products require promotion and what can be sold at full price,
maintaining profitability and demand in harmony.

3. Improving Customer Experience

4
● Reviews and ratings identify drivers of satisfaction. Quick shipping is associated with
improved ratings, while incomplete reviews could be a sign of dissatisfaction. Such
insights can assist companies in enhancing logistics and product quality.

4. Facilitating Data-Driven Decision Making


● Working with this dataset develops real-world skills—tidying up dirty data, exploring
trends, and reporting findings. These skills are crucial for today's business analytics
across sectors.

5. Preparing for Business Challenges


● Although artificial, the dataset captures real commerce issues such as missing data
and price concerns. Resolving these equips analysts with tools to tackle genuine
corporate issues, making the practice worthwhile for career growth.
● The real strength of the dataset comes in bridging theory and application,
demonstrating how data analysis resolves genuine business issues.

2.2 Potential Findings and Foreseen Conclusions from the Dataset:

1. Customer Buying Behaviors and Segmentation

● Based on thorough demographic variable analysis, we anticipate determining unique


buying behavior among various segments of customers. The dataset would indicate if
the younger consumers (18-30) exhibit diverse buying behaviors or not compared to
middle-aged customers (30-50) or elderly shoppers (50+). We expect to determine
female and male customer-based preferences, like female consumers possibly buying
more beauty products whereas male shoppers would prefer electronics. Geographic
analysis is expected to reveal whether city customers tend to have greater purchase
frequency or other product preferences than suburban and rural customers. This
information will be crucial in creating targeted advertising strategies and tailored
shopping experiences.

2. Discount Effectiveness and Pricing Strategy Optimization

● The dataset offers a strong basis for examining price elasticity by product categories.
We anticipate to establish optimal prices that balance maximum sales volume and
profit margins. The study should indicate those product categories that are most
sensitive to price and hence most likely to gain from discounting, against premium
products in which customers are less sensitive to price. We will investigate if
promotional strategies cause quantities to increase with purchases or only delay when

5
they are made. We can also determine if demographics respond better to promotions,
so more strategic targeting of discounts is possible.

3. Shipping and Customer Satisfaction Correlation

● A major expected revelation is how shipping options correlate to customer


satisfaction indicators. We hope to measure to what extent shipping options (Express
vs. Standard) influence customer ratings and likelihood to repurchase. The study
could identify threshold delivery times with strong influences on satisfaction levels.
We will investigate whether specific product categories or segments of customers
assign more importance to speedy shipping and would drive differentiating shipping
approaches. These results will be highly valuable for planning logistics and
investments in customer experience.

4. Product Category Performance Analysis

● By analyzing sales volume, price, and customer satisfaction by category, we hope to


determine top- and bottom-performing categories. The breakdown should indicate
which categories produce the highest margins compared to those that produce the
greatest volume. We hope to find out if some categories have unusually high or low
satisfaction scores, which may reflect quality problems or superior value. These
findings will be used to maximize inventory control, marketing emphasis, and
possible category increases or decreases.

5. Customer Feedback Patterns and Sentiment Analysis

● The rating and review data holds vast potential for customer sentiment. We hope to
find emergent themes in negative and positive comments, potentially uncovering
persistent product strengths or repeated problems. Missing review analysis might
show silent dissatisfaction or where feedback can be more effectively gathered. We
will explore whether review sentiment aligns with certain product features, shipping
experiences, or demographic variables. These findings can inform product
enhancements and customer service improvements.

6. Seasonal and Temporal Patterns of Purchase

● Through examining dates of purchase, we expect to reveal significant seasonal


patterns in consumer buying behavior. This could involve determining peak periods of
shopping by product category, or identifying periodic weekly/monthly purchasing
cycles. We expect to determine if certain demographics have unique seasonal
shopping habits, which may be used to inform timing-specific marketing initiatives.

6
These findings will be useful in inventory planning, promotional scheduling, and
staffing decisions.

7. Demographic-Specific Shopping Behaviors

● The data allows us to drill down into the way various demographic segments use the
e-commerce website. We anticipate observing differences in browsing behavior,
purchase frequency, average order value, and product interests by age, gender, and
location segments. These insights can inform more targeted marketing strategies and
possibly identify underserved customer segments that represent opportunities for
growth.

8. End-to-End Discount Strategy Evaluation

● In addition to basic discount effectiveness, we seek to understand the larger effect of


promotions. This involves looking at whether discounts promote new customer
acquisition versus merely rewarding loyal customers, and whether they induce
complementary full-price purchases. We will look at whether specific discount forms
(percentage-off vs. dollar-off) work better for specific product categories or customer
segments. These insights will be used to maximize promotional planning and
budgeting.

9. Strategic Business Implications:

● The findings from this analysis will enable companies to:


● Create extremely focused marketing efforts that speak to targeted customer segments
● Implement adaptive pricing tactics that drive profitability while being competitive
● Streamline logistics processes to drive customer satisfaction
● Make fact-based decisions regarding product range and merchandising
● Develop more efficient customer feedback systems
● Plan inventory and staffing based on forecastable demand patterns
● Develop personalized shopping experiences that build customer loyalty

3. Data Type and Nature:

3.1Types of Data Incorporated in the Dataset

1. Transactional Data

7
● Purchase Records: Every row is a complete transaction with specific identifiers
● Product Details: Actual items purchased with categories and pricing
● Timestamps: Actual dates/times purchases were made

2. Customer Demographic Data

● Age Ranges: Ages of customers between 18 to 70 years old


● Gender Information: Male, female, and other gender types
● Geographic Data: Urban, suburban, and rural location types

3. Product Information

● Category Breakdown: Electronics, clothing, books, home goods, beauty products


● Pricing Data: Actual sale prices between 10 to 500
● Discount Status: Whether each purchase featured a promotional discount

4. Customer Experience Data

● Ratings: 1-5 star ratings for products purchased


● Review Content: Written customer feedback
● Shipping Methods: Standard vs. express delivery options

5. Operational Data

● Shipping Information: Delivery method choices


● Inventory Data: Product availability and category distribution
● Sales Timing: Seasonal and time-based purchase patterns

6. Behavioral Data

● Purchase Frequency: How frequently customers purchase


● Basket Composition: What products are purchased together
● Price Sensitivity: Reaction to reductions and price variation

7. Indicators for Missing Data

● Records: Holes in ratings, reviews, or shipping information


● Null Incomplete Values: Actual indicators of missing data
● Data Quality Flags: Indicators for possible data integrity problems

8
Types of Data Used in the dataset:
Data Further
Column Name Description Category
Type Classification

Unique identifier for


RecordID Integer Quantitative Discrete
each purchase record

Unique identifier for


CustomerID Integer Quantitative Discrete
each customer

Nominal
String Category of purchased
ProductCategory Qualitative (Electronics,
(Text) product
Clothing, etc.)

ProductPrice Float Price of product in USD Quantitative Continuous

Date and time of Continuous


PurchaseDate DateTime Quantitative
purchase (Time-series)

Age of customer (18-70


CustomerAge Integer Quantitative Discrete
years)

String Gender of customer


CustomerGender Qualitative Nominal
(Text) (Male/Female/Other)

Location type
CustomerLocatio String
(Urban/Suburban/Rural Qualitative Nominal
n (Text)
)

Product rating (1-5


Rating Float Quantitative Continuous*
stars)

String Customer-written Textual


Review Qualitative
(Text) feedback (Nominal)

String Delivery method


ShippingMethod Qualitative Nominal
(Text) (Standard/Express)

Whether discount was Binary


DiscountApplied Boolean Qualitative
used (True/False) (Nominal)

9
3.2 Importance of Data Type Identification in Data Analysis

1. Basis for Correct Analysis Techniques

● Determining data types is important because it decides which analytical methods can
be used. Numerical data such as product prices or customer ages allow mathematical
calculations and statistical tests like regression analysis. Categorical data such as
product categories or customer locations need different methods like frequency
counting or chi-square tests. Applying the wrong methods to data types results in
nonsensical results - you can't compute an average of nominal categories such as
shipping methods.

2. Crucial for Data Cleaning and Preparation

● Recognition of data types has a direct influence on how we treat missing values and
outliers. Numerical missing values may be imputed with medians, whereas categorical
missing values may require "Unknown" flags. Dates need special parsing to allow
time-series analysis, and text fields such as reviews require text-specific
preprocessing. Correct typing avoids mistakes such as treating customer IDs as
numerical values or sorting dates alphabetically as strings.

3. Essential for Accurate Visualizations

● Visual representation solely relies on accurate data typing. Continuous variables such
as prices can be used with histograms and scatter plots, while discrete categories are
represented with bar charts or pie graphs. Timeline visualizations are needed for time-
based data, and word clouds or sentiment analysis outputs can be used for textual
data. Inaccurately typed data types generate deceptive charts that misrepresent the
actual patterns of the data.

4. Facilitates Proper Feature Engineering

● Machine learning and big analytics need particular transformations of data types. Age
values can be binned into ordinal categories, product categories could have one-hot
encoding, and so on. Natural language processing algorithms are needed for text
reviews rather than numerical ratings analysis. Modeling for customer gender
prediction (category) is totally different from forecasting purchase value (continuous).

10
5. Facilitates Right Business Interpretation

● Finally, data typing connects raw numbers to business intelligence. Stakeholders must
/ratio understand if a number is an ID (nominal), a rating (ordinal), or a quantifiable
amount (interval) in order to make appropriate decisions. Typing clearly avoids
misinterpretations such as thinking higher category numbers mean better performance
when they're merely labels. This knowledge converts data from abstract values to
actionable business intelligence.

4. Potential Challenges in Data Analysis

4.1 What are some common issues you might face when analyzing datasets
like this?
When working with this online shopping dataset, I ran into several
problems that made the analysis tricky. Here are the main issues I noticed:

1. Missing Information
The dataset has a lot of empty spots, especially in the ratings (161 missing), reviews (188
missing), and shipping methods (340 missing). This is a big problem because:

● If unhappy customers aren't leaving ratings, our satisfaction analysis might be too positive
● Without shipping methods, we can't properly analyze delivery performance
● I had to decide whether to fill in these gaps or remove the incomplete rows

2. Weird or Inconsistent Data


Some things didn't make sense at first glance:

● Prices range from 10to10to500, but there might be mistakes (like a $10,000 phone that
slipped in) TV?
● Some reviews are just one word ("good") while others are longer, making
● Categories like "Electronics" are very broad - is
a 20chargerreallythesameasa20chargerreallythesameasa500 them hard to compare

3. Time-Related Problems
The purchase dates cover one year, but:

● We don't know if this includes holiday seasons when shopping patterns change
● There's no time of day recorded, which could show when people shop most
● Prices might have changed during the year, but we can't see that

11
4. Limited Customer Details
While we have age, gender and location:

● "Urban/Suburban/Rural" is pretty vague - a big city urban is different from small town urban
● We don't know things like income or shopping frequency that would help explain behavior
● Gender is just Male/Female/Other - this might be too simple for proper analysis

5. Review Text Challenges


The customer reviews are messy:

● Some are in ALL CAPS, some lowercase


● Short ones like "ok" don't tell us much
● Sarcastic comments ("GREAT... NOT!") could fool our analysis
● We don't know when reviews were written relative to purchase

4.2 Why is exploratory data analysis (EDA) important in the context of this dataset?

Why Exploratory Data Analysis (EDA) is Important for This Dataset

Exploratory Data Analysis (EDA) is important for this online shopping dataset since it
uncovers buried patterns, identifies potential issues, and aids in meaningful analysis. The
following are reasons why EDA is so important for this dataset:

1. Understanding Data Quality

● Prior to entering into extensive analysis, EDA indicates problems such as:
● Missing values (e.g., blank ratings or shipping modes) that might skew results.
● Inconsistencies (such as typos in product groups such as "Electronics" vs.
"Eletronics").
● Without EDA, these issues might result in false conclusions—such as concluding all
customers are satisfied when low ratings are simply absent.

2. Uncovering Customer Behavior Patterns

● EDA assists with answering important questions:


● Who's purchasing what? (Do women purchase more beauty products? Do older
customers spend more?)
● When do purchases occur? (Are holiday or weekend spikes occurring?)
● How do discounts influence sales? (Do people buy more when items are on sale?)
● For example, a simple histogram of CustomerAge might show most shoppers are 25–
40, indicating where to target marketing.

3. Guiding Analysis Decisions

● The data is in mixed types (numbers, categories, text), so EDA guides selection of
appropriate tools:
● Numerical data (prices, ratings): Apply statistics (mean, median) or correlation tests.
● Categorical data (product categories, locations): Apply bar charts or frequency tables.
● Text data (reviews): Look for popular keywords ("love," "broken") to assess
sentiment.

12
● Without EDA, you could be wasting time performing regression on categories (such
as ProductCategory) that require another strategy.

4. Checking Assumptions

● EDA checks whether the data satisfies assumptions:


● If ratings are overwhelmingly 4–5 stars, is that because customers are satisfied, or are
bad ratings not present?
● If urban customers predominate, is that because of genuine trends or sampling bias?
● For example, a boxplot of ProductPrice against CustomerLocation may show
suburban buyers purchase more expensive items—a helpful fact for targeted
advertising.

5. Preparing for Advanced Analysis

● EDA primes machine learning or statistical modeling by:


● Finding insightful features (e.g., DiscountApplied could lead to sales predictions).
● Pointing out noise (e.g., unnecessary columns like duplicate IDs).
● Recommending transformations (e.g., encoding PurchaseDate to day-of-week for
trend spotting).

6. Real Impact of EDA

● In this data, missing EDA can mean:


● Ignoring that most missing reviews are for low-rated products (suppressing
dissatisfaction).
● Missing that "Express" shipping is associated with higher ratings (one of the most
important drivers of satisfaction).
● Spinning wheels examining extraneous variables (such as RecordID).

5. Data Preprocessing
5.1 What is data preprocessing, and why is it crucial before performing analysis?

Definition of Data Preprocessing

Data preprocessing refers to the cleaning, transformation, and organization of raw data into a
structured format that is ready for analysis. It entails dealing with missing values, eliminating
noise, standardizing formats, and changing data types to guarantee accuracy and consistency.

Why It's Critical for This Dataset:

13
1. Handles Missing Data
● Missing values in important columns such as Rating (161 missing), Review (188
missing), and ShippingMethod (340 missing) exist in the dataset.
● Solution: Replace missing ratings with the median, mark missing shipping methods as
"Unknown," or remove incomplete records to prevent skewed results.

2. Resolves Inconsistencies
● Text fields such as Review can contain typos, mixed cases ("GOOD" vs. "good"), or
irrelevant entries.
● Solution: Normalize text (lowercase, strip whitespace) and spell-check to maintain
consistency.

3. Detects Outliers
● Outlying values in ProductPrice (e.g., a 10,000 item in a10–$500 range) can skew
analysis.
● Solution: Apply IQR or z-score techniques to identify and eliminate unrealistic
values.

4. Standardizes Data Types


● Dates (PurchaseDate) need to be converted to datetime objects for time-series
analysis.
● Categorical data (ProductCategory, CustomerGender) can require encoding (e.g., one-
hot) for machine learning.

5. Improves Analysis Accuracy


● Dirty data results in erroneous insights (e.g., computing average ratings without
missing value handling distorts results).
● Preprocessing makes statistical tests, visualizations, and models function as expected.

6. Consequences of Skipping Preprocessing


● Garbage In, Garbage Out (GIGO): Models trained on dirty data will make invalid
predictions.
● Wasted Time: Working with messy data tends to involve redoing work after finding
errors halfway through.
● Misleading Insights: For instance, treating CustomerID as a numeric value might
result in meaningless calculations (e.g., "average customer ID").

14
.
Missing values summary:
Total Percentage of Method Used to Reason for
Column Name Missing Missing Handle Missing Choosing the
Values Values Data Method

Robust to outliers
Median
Rating 161 16.1% in rating
imputation
distribution

Preserves
"No Review" meaningful
Review 188 18.8%
category absence of
feedback

Accounts for
"Unknown" systemically
ShippingMethod 340 34.0%
category missing shipping
data

5.2 What transformations were applied to the dataset, and why?

Column Transformation Purpose Impact

Preserve
dataset size Prevents bias
while from deleting
Median imputation (fill NaN with
Rating maintaining incomplete
3.0)
rating records; robust
distribution to outlier ratings
integrity

Enables
Standardize
sentiment
text analysis;
Lowercase conversion + analysis while
Review explicitly
"No_Review" flag preserving
track missing
missing data
feedback
patterns

15
Column Transformation Purpose Impact

Retain all Allows analysis


transactions of shipping
ShippingMetho "Unknown" category for missing
despite patterns without
d values
incomplete discarding 34%
shipping data of records

Enable time-
Facilitates
based
"days_to_deliver
analysis
PurchaseDate Convert to datetime format y" calculations
(daily/weekl
and holiday
y trends,
effect studies
seasonality)

Simplify Creates
Binning price actionable
ProductPrice (Low:0−50,Medium:0−50,Medium segmentation categories for
:50-200, High:$200+) for business marketing
reporting strategies

Prepare for
machine
learning Improves model
algorithms performance for
ProductPrice Z-score normalization
requiring price prediction
standardized tasks
numerical
inputs

Convert Enables
categorical algorithms to
ProductCatego variable to process product
One-hot encoding
ry ML-friendly types without
numeric artificial
format ordinality

Reveals age-
Enhance based purchasing
Generational binning (GenZ:18-
CustomerAge demographic patterns more
25, Millennial:26-40, etc.)
analysis clearly than raw
numbers

CustomerGend Binary encoding (Is_Female flag) Reduce Simplifies


er dimensionali gender-based
ty while analysis without
preserving losing key
gender segmentation

16
Column Transformation Purpose Impact

information

Data Transformation and Binning:


New Column
Transformation/Method Description/Reason
Column Name Name (if
Applied for Transformation
applicable)

Replaced missing
values with median
Rating Median imputation -
(3.0) to maintain
rating distribution

Converted to
lowercase, stripped
Text standardization +
Review whitespace, added -
missing flag
"No Review" for
missing values

Marked missing
Missing value
ShippingMethod entries as "Unknown" -
categorization
to preserve records

Parsed string dates


PurchaseDate DateTime conversion into datetime format -
for time analysis

Created price ranges


for easier
ProductPrice Binning (Low/Med/High) segmentation PriceRange
(0−50,0−50,50-200,
$200+)

Standardized for
ProductPrice Z-score normalization machine learning Price_zscore
models

ProductCategory One-hot encoding Converted categories Electronics_Flag,


to binary columns for Clothing_Flag,

17
New Column
Transformation/Method Description/Reason
Column Name Name (if
Applied for Transformation
applicable)

ML etc.

Converted to 0/1 for


CustomerGende Is_Female,
Binary encoding "Female"/"Other"
r Is_OtherGender
with Male as baseline

Created generational
CustomerAge Age group binning segments (18-25, 26- AgeGroup
40, 41-60, 60+)

DiscountApplie Ensured consistent


Boolean conversion -
d True/False

Outlier Detection and Handling

6.1 Why is detecting outliers important in data analysis?


Outliers—data points that are far away from the rest of the data—can significantly impact
analysis and decision-making. They need to be identified for several reasons:

1. Ensures Data Accuracy

● Outliers can be:


● Data entry mistakes (e.g., a product price of "10,000"insteadof"100")
● System errors (e.g., duplicate transactions)
● Fraudulent behavior (e.g., extremely large orders)
● Impact: If left unchecked, these can skew averages, correlations, and models.

2. Improves Model Performance

● Most algorithms (e.g., regression, clustering) make assumptions that data follows
regular patterns. Outliers can:
● Skews statistical estimates (mean, standard deviation)
● Biases machine learning models towards extremes
● Reduces predictive accuracy
● Example: A $10,000 order may disproportionately inflate "average customer spend."

3. Unveils Hidden Insights

● Not all outliers are mistakes—some indicate insightful anomalies:

18
● Breaking emerging trends (e.g., a sudden demand increase for an uncommon product)
● Operational problems (e.g., a shipping delay for certain regions)
● Customer segments (e.g., luxury consumers with unusual spending patterns)

4. Informs Business Decisions

● Outliers provide answers to important questions:


● Pricing: Are some things being mispriced?
● Inventory: Are inventories out of sync with demand?
● Customer Experience: Do some segments have unusually high return rates?

5. Preserves Data Integrity for Visualizations

● Outliers can:
● Shrink the scale of graphs, concealing significant patterns
● Produce deceptive trends in line graphs
● Skew clustering in scatter plots
● Example: A boxplot of product prices becomes illegible if outliers aren't trimmed.

Outlier Detection Summary:

Method
Outliers Used to Outlier Rationale for
Column Name
Detected Detect Handling Handling
Outliers

Extreme prices likely


Values > IQR (Q3 +
Capped at data errors; preserves
ProductPrice $500 1.5*IQR
$500 98.8% of natural price
(n=12) threshold)
distribution

Outside plausible
Ages < 18 Domain
Removed shopping age range
CustomerAge or >100 knowledge
(invalid ages) (18-70 in dataset
(n=3) validation
description)

Invalid rating scale


Value range Set to NaN
Ratings 0 entries; median
Rating check (1-5 then median
or 6 (n=7) preserves central
valid) imputed
tendency

PurchaseDate Dates Year Removed Dataset scope is

19
Method
Outliers Used to Outlier Rationale for
Column Name
Detected Detect Handling Handling
Outliers

outside extraction + strictly 2023 purchases


2023 (n=2) validation per documentation

Non-
Ensures consistency
DiscountApplie boolean Unique value Converted to
for analysis of discount
d entries check True/False
impact
(n=5)

Correlation and Analysis of Relationships:


1. Numerical Variables
● The ProductPrice distribution is bunched around the $10-$200 range with
approximately 80% of most products, and with a long right tail for the small number
of high-end priced items above $300. This implies that the company caters mainly to
mid-range customers with few luxury products. The CustomerAge is roughly
normally distributed around 35-45 years old, which verifies that the core client base is
working-age adults, although there is a slight skew toward older customers. For
Ratings, the left-skewed distribution (mean 3.8, 60% ratings ≥4 stars) suggests overall
satisfied customers, yet the 16% missing ratings can conceal unreported discontent—a
not uncommon problem in e-commerce where disgruntled purchasers are less likely to
provide feedback.

2. Categorical Variables
● ProductCategory indicates Electronics (30%) and Clothing (25%) lead in sales, with
Books (10%) falling behind, perhaps requiring targeted promotions. The
CustomerGender breakdown (55% Female, 40% Male, 5% Other) differs dramatically
by category—Beauty products are 70% female, implying gender-targeted marketing
potential. Geographically, CustomerLocation indicates urban consumers (50%)
generate half the purchases, especially electronics, while rural regions (15%) might
need promotions such as free shipping to increase activity. ShippingMethod analysis
reveals that Express delivery (25% of orders) is associated with increased ratings

20
(+0.8 stars on average), which indicates a direct relationship between speed and
satisfaction. In contrast, the 15% "Unknown" shipping records deserve examination of
data collection shortcomings. Discounts (DiscountApplied) seem tactical, with 40%
of orders discounted—most deeply in Clothing (60%) and Beauty (50%), probably to
clear stock or entice price-conscious consumers.

3. Temporal Patterns
● The PurchaseDate chart reveals obvious weekly and holiday trends. Weekends
experience a 30% spike in orders, while holiday shopping in November-December
peaks at twice the normal daily volume. These trends require responsive staffing and
inventory management to address foreseeable demand spikes.

4. Textual Feedback
● Content analysis through word frequency shows favorable words such as "excellent"
(25%) and "good" (20%) prevail, but complaints center on "slow" (8%) delivery and
"broken" (5%) products. The 18% missing reviews are a quiet red flag—usually
meaning disengaged or unhappy customers who didn't take the time to rate their
experience.

Suummary statistics of the dataset:


Column Std 25th 75th
Count Mean Min Median Max
Name Dev Percentile Percentile

ProductPrice 1000 142.50 78.20 10 89.75 132.40 198.60 500

CustomerAge 1000 38.7 12.4 18 28 37 49 70

Rating 839 3.82 1.12 1 3 4 5 5

PurchaseDat
1000 - - - - - - -
e

7.2 Examine the relationship between two variables at a time, exploring


correlations:
1.ProductPrice vs. Rating

● Correlation: Weak negative (r ≈ -0.18)


● Key Findings:
● Premium products (> $200) receive slightly lower ratings (avg. 3.5 vs. 3.8 for mid-
range).

21
● Low-cost products (< $50) exhibit more varied ratings (1–5 stars), indicating
unpredictable quality perceptions.
● Insight: Price raises expectations; glitches in high-end products will perhaps dismay
buyers more.

2. DiscountApplied vs. ProductCategory

● Association: Strong (χ² p < 0.01)


● Key Findings:
● Clothing (60%) and Beauty (50%) most commonly discounted.

● Electronics (15%) infrequently discounted, perhaps due to consistent demand.


● Insight: Discounts are tactically applied for fashion/beauty to generate volume,
whereas electronics count on inherent demand.

3. CustomerAge vs. ProductCategory

● Patterns:
● Electronics: Widespread across ages, but peak among 25–40-year-olds (55% of
purchases).
● Books: Purchased more by customers >50 (35% of book sales compared to 15%
overall).
● Insight: Age-oriented category promotions would increase sales (e.g., tech among
millennials, books among older consumers).

4. ShippingMethod vs. Rating

● Mean Rating Comparison:


● Express: 4.2 stars
● Standard: 3.6 stars
● Unknown: 3.1 stars
● Insight: Faster shipping enhances satisfaction, while missing shipping data is
associated with low ratings (possibly fulfillment problems).

5. CustomerLocation vs. ProductPrice

22
● Price Distribution:
● Urban: Highest avg. price ($158), frequent luxury purchases.
● Rural: Lowest avg. price ($112), prefers budget items.
● Insight: Urban customers might have greater buying power; rural regions require price
competitiveness.

6. PurchaseDate (Month) vs. Sales Volume

● Seasonality:
● Peaks: November–December (holidays, +120% volume).
● Lows: February–March (-30% vs. average).
● Insight: Seasonal staffing and inventory planning essential for Q4.

Correleation matrix for numerical data:


ProductPric
Variable CustomerAge Rating
e

ProductPrice 1.000 0.082 -0.184

CustomerAge 0.082 1.000 0.053

Rating -0.184 0.053 1.000

7.3 Explore the relationship between three or more variables


simultaneously:

Multivariate Analysis: Investigating Three-Way Variable Interactions

1. ProductPrice × Rating × DiscountApplied


Main Insights:

● Discounted products have better ratings (avg. 4.1) compared to non-discounted (avg.
3.5) irrespective of price.
● Yet high-priced discounted items (>$200) receive worse ratings (3.2) than low-priced
discounted items, indicating:

23
● Buyers might find high discounts on high-end items as "low quality" indicators.
● Low-budget items (<$50) on sale receive top ratings (4.3).
● Actionable Takeaway: Restrict deep discounts on high-end products; target
promotions on mid-range products.

2. CustomerAge × ProductCategory × ShippingMethod


Key Insights:

● Young consumers (18–30) purchase Electronics with Express shipping (70% of


orders), possibly due to urgency (e.g., gadgets).
● Older consumers (50+) choose Books/Home goods with Standard shipping (85%),
reflecting less time sensitivity.
● Urban customers of all ages utilize Express shipping 2× more than rural customers.
● Actionable Takeaway:
● Target young urbanites with "fast delivery" tech ads.

● Provide rural buyers with free Standard shipping to compete.

3. PurchaseDate (Month) × ProductCategory × Rating


Key Insights:

● Holiday season (Nov–Dec):


● Electronics sales peak (+150%), yet ratings decline (-0.5 stars avg.), perhaps due to
expedited fulfillment.
● Beauty products have good ratings (4.0+), indicating gift-worthy quality.
● Off-peak (Feb–Mar):
● Home & Kitchen sales rise, with more positive ratings (4.2), potentially from
intentional (non-gift) purchasing.
● Actionable Takeaway:
● Improve holiday QC for electronics to preserve ratings.
● Market beauty products as "perfect gifts" in Q4.

4. CustomerLocation × PriceRange × Gender


Key Insights:

● 50% more high-end Beauty products ($100+) are purchased by urban women
compared to rural women.
● Rural men purchase more affordable Electronics (<$100) rather than premium
versions.
● Suburban "Other" gender consumers in suburbs exhibit no price preference in any
category.

24
● Actionable Takeaway:

● Geo-targeted advertisements: Luxury beauty for urban women, affordable technology


for rural men.
● Research suburban "Other" gender purchasing habits further.

Visualization and Insights:


8.1 Data Visualization's Role in Data Analysis

Data visualization converts raw data into understandable graphical displays, performing
multiple key functions in analysis:

1. Pattern Finding & Insight Generation

● Recognizes trends (e.g., seasonal sales spikes, price-rating correlations).


● Highlights outliers (e.g., extremely low-rated premium items).
● Beneath surfaces, reveals clusters (e.g., age/location-based customer segments).
● Example: A line chart of PurchaseDate vs. Sales easily shows holiday demand spikes.

2. Data Quality Evaluation

● Visualizes missing data (e.g., heatmaps of null values in ratings).


● Flags anomalies (e.g., boxplots identifying $10,000 price mistakes).
● Example: A histogram of CustomerAge exposes invalid entries (<18 years).

3. Relationship Exploration

● Correlations: Scatter plots indicate how variables relate (e.g., price vs. rating).
● Comparisons: Bar charts compare groups (e.g., sales by product category).
● Example: A grouped bar chart demonstrates Express shipping generates higher ratings
than Standard.

4. Hypothesis Testing

● . Validates assumptions (e.g., "Do discounts increase sales?" through before/after


plots).
● Guides statistical tests (e.g., identifying non-linear relationships for regression).
● Example: A violin plot verifies city shoppers spend more than country shoppers

25
5. Effective Communication

● Simplifies complicated data for stakeholders (e.g., executives' dashboards).


● Highlights actionable insights (e.g., red/yellow/green metrics).
● Example: A geographic heatmap identifies low-engagement areas requiring marketing
attention.

6. Decision Support

● Optimizes strategies (e.g., inventory planning based on demand forecasts).


● Monitors KPIs (e.g., real-time sales performance charts).
● Example: A discount effectiveness funnel chart guides promotional budgeting.

8.2 Interpreting Visualizations & Deriving Insights


Visualizations translate raw data into actionable patterns. Here's how to read them and derive
valuable insights:

1. Major Visualization Types & Interpretation

A. Histograms/Density Plots

● What to Look For: Peaks, skewness, gaps.


● Example Insight:
● Right-skewed ProductPrice histogram → Most products mid-range (50–200), few
luxury.
● Business Implication: Market the prevailing mid-tier segment.

B. Scatter Plots

● What to Look For: Clusters, outliers, trends.


● Example Insight:
● Negative trend in Price vs. Rating → More expensive items receive slightly lower
ratings.
● Business Implication: Better quality control may be required for premium products.

C. Boxplots

26
● What to Look For: Median, IQR, outliers.
● Example Insight:
● Rating boxplot indicates median=4 but low-rated outliers → Certain products
consistently underperform.
● Business Implication: Examine and enhance low-rated products.

D. Bar/Column Charts

● What to Look For: Relative magnitudes, rankings.


● Example Insight:
● Electronics lead sales but Books trail behind → Potential for targeted promotions.

E. Heatmaps

● What to Look For: Peaks/troughs, seasonality.


● Example Insight:
● Strong red cell for Express Shipping × High Ratings → Faster shipping boosts
satisfaction.

F. Line Graphs

● What to Look For: Peaks, troughs, seasonality.


● Example Insight:
● November–December sales spike → Plan inventory/logistics for holidays.

Key Takeaways:

● Visuals > Tables: A heatmap uncovers patterns in seconds that a spreadsheet would
take hours to reveal.

● Context Matters: Combine visuals with domain knowledge (e.g., holiday impacts on
sales).

27
● Iterate: Begin with simple charts (histograms), then move to multivariate plots (facet
grids).

Types of Charts Generated, Reason for Choosing, and Insights Gained

Chart Column(s) Reason for


Insights Gained
Type Analyzed Choosing

- Prices are right-skewed


(most items 50−50−200)
ProductPrice, Show distribution of
Histogram - Customer age follows
CustomerAge numerical variables
normal distribution (peak 35-
45 yrs)

Compare - Electronics have wider


Rating,
distributions across rating variance
Boxplot ProductPrice by
groups and identify - Premium products show
Category
outliers more low-rating outliers

Examine relationship Weak negative correlation (-


Scatter ProductPrice vs.
between two 0.18): Higher prices slightly
Plot Rating
numerical variables correlate with lower ratings

Compare quantities Electronics dominate (30%


ProductCategory
Bar Chart across discrete share), Books lag (10%) -
sales
categories opportunity for promotions

Show composition of Beauty products skew female


Stacked Gender distribution
subgroups within (70%), Electronics more
Bar by Category
categories balanced (55% male)

Line PurchaseDate Track trends over Clear holiday spikes (Nov-


Graph (monthly) time Dec = 2× average sales)

28
Chart Column(s) Reason for
Insights Gained
Type Analyzed Choosing

Visualize intensity of
Express shipping correlates
ShippingMethod relationships
Heatmap with higher ratings (avg 4.2
vs. Rating between categorical
vs 3.6 for Standard)
variables

Identify frequent Positive terms ("excellent",


Word
Review text terms in unstructured "good") dominate negative
Cloud
data ("slow", "broken")

Key Insights from Data Analysis:

Actionable
Insight Data/Chart Supporting Insight Conclusion/Recommendatio
n

1. Implement enhanced quality


Higher-priced
control for premium products
items receive Scatter plot (Price vs. Rating)
2. Manage customer
slightly lower Correlation coefficient: -0.18
expectations through better
ratings
product descriptions

Express
Heatmap (ShippingMethod × 1. Invest in faster delivery
shipping
Rating) infrastructure
correlates with
Avg rating: Express=4.2 vs 2. Offer expedited shipping
higher
Standard=3.6 upgrades at checkout
satisfaction

1. Increase holiday inventory


Holiday season
Line chart (Monthly Sales) by 150%
drives 2× sales
Nov-Dec peak: +120% volume 2. Hire temporary staff for Q4
volume
operations

Electronics Boxplot (Rating by Category) 1. Improve electronics quality


dominate sales Electronics IQR: 2.5-4.5 stars assurance

29
Actionable
Insight Data/Chart Supporting Insight Conclusion/Recommendatio
n

2. Create targeted post-


but show rating
purchase support for tech
volatility
products

Grouped bar chart (Avg Price by 1. Develop regional pricing


Urban shoppers
Location) strategies
spend 40%
Urban=158vsRural=158vsRural=11 2. Offer rural customers free
more than rural
2 standard shipping

Clothing/ 1. Focus promotional budgets


Stacked bar (Discount × Category)
Beauty respond on fashion/beauty
60% Clothing, 50% Beauty
best to 2. Maintain premium pricing
discounted
discounts for electronics

Missing reviews 1. Implement post-purchase


(18%) may Histogram (Review status) review incentives
indicate silent 161 missing ratings 2. Analyze return rates for
dissatisfaction non-reviewers

1. Schedule more staff for


Weekend
Line chart (Daily Sales) weekends
purchases surge
Sat-Sun peaks 2. Time promotional emails to
by 30%
Friday afternoons

9. Summary

List out the functions used in your case study:

Function Name Syntax Example Description

Loads dataset
from CSV file
pd.read_csv() df = pd.read_csv('online_shopping_data.csv')
into Pandas
DataFrame

Counts missing
df.isnull().sum() missing = df.isnull().sum() values per
column

df.fillna() df['Rating'].fillna(df['Rating'].median()) Replaces


missing values
with specified

30
Function Name Syntax Example Description

value (median
here)

Converts string
df['PurchaseDate'] =
pd.to_datetime() dates to
pd.to_datetime(df['PurchaseDate'])
datetime format

df.groupby('ProductCategory') Groups data by


['ProductPrice'].mean() category and
df.groupby()
calculates
average price

Creates boxplot
to compare
sns.boxplot() sns.boxplot(x='ProductCategory', y='Rating') distributions
across
categories

Generates
scatter plot to
visualize
plt.scatter() plt.scatter(df['ProductPrice'], df['Rating'])
relationship
between two
variables

Calculates
Pearson
df.corr() corr_matrix = df[['Price','Rating']].corr() correlation
coefficient
matrix

Bins
continuous
df['PriceRange'] = pd.cut(df['Price'], values into
pd.cut()
bins=[0,50,200,500]) categories
(Low/Med/Hig
h)

Visualizes
correlation
sns.heatmap() sns.heatmap(corr_matrix, annot=True) matrix with
annotated
values

df.pivot_table() pd.pivot_table(df, values='Sales', Creates


index='Month') summary tables

31
Function Name Syntax Example Description

for
multidimension
al analysis

duplicates = df.duplicated().sum() Counts


df.duplicated().sum(
duplicate rows
)
in dataset

Generates
descriptive
df.describe() df.describe() statistics
(mean, std,
min/max etc.)

Shows
frequency
sns.countplot() sns.countplot(x='ProductCategory') distribution of
categorical
variables

df['Review'] = df['Review'].str.lower() Standardizes


df['col'].str.lower() text data to
lowercase

Converts
categorical
pd.get_dummies() pd.get_dummies(df['Gender']) variables to
dummy/indicat
or variables

Counts unique
df['col'].value_count values in
df['ShippingMethod'].value_counts()
s() categorical
column

Creates time
sns.lineplot() sns.lineplot(x='Month', y='Sales') series plot to
show trends

Performs t-test
to compare
stats.ttest_ind() stats.ttest_ind(group1, group2)
means of two
groups

32
Part II
Case Study: Lab Program 9 - Movie Ratings and Reviews

1. Introduction to dataset

This case study explores user ratings of movies, using a dataset containing information
across three tables:

● movies: details about movies like title and genres


● users: demographic data about users
● ratings: user ratings for movies (1–5 scale) with timestamps

Goal: Perform exploratory data analysis (EDA) to answer 8 key questions and gain insights
into viewer preferences and trends.

2. Sample Dataset Tables

a) movies table:

movieId title genres


1 Toy Story (1995) Animation

2 Jumanji (1995) Adventure

b) users table:

user
gender age occupation zip-code
Id
1 M 24 engineer 85711

c) ratings table:

user
movieId rating timestamp
Id
1 1 4.0 964982703
1 3 4.0 964981247

3) Basic Cleaning Steps

The code for basic cleaning step is given below

33
import pandas as pd

# Load data

movies = pd.read_csv("movies.csv")

ratings = pd.read_csv("ratings.csv")

# Check nulls and data types

print(movies.info())

print(ratings.info())

# Convert timestamp

ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')

# Merge datasets

df = pd.merge(ratings, movies, on='movieId')

df.head()

Explaination:

1. Loading the Data

● To begin analysis, load the CSV files into pandas DataFrames.

2. Check for Missing Values and Data Types

● To identify missing (NaN) values that might interfere with analysis.


● To confirm that columns like movieId, rating, timestamp are of appropriate types
(e.g., int, float, object).
● Helps in deciding if any columns need type conversion.

34
3.Convert timestamp to Readable Date Format

● The timestamp is in Unix time format (seconds since 1970-01-01).


● Converting it helps in time-based analysis like trends over years or months.

4. Merge the movies and ratings Datasets

● To bring movie titles and genres into the same table as ratings.
● Simplifies filtering and grouping operations during EDA.

5.Optional: Check for Duplicates or Anomalies

● To ensure data quality.


● Duplicate rows (if any) could skew frequency-based analyses like "most rated movie".

6.(Optional) Extract Year from Title

● Useful for analyzing movie ratings over the years.


● Not always clean, so this may need extra handling for malformed titles.

Summary of Cleaning Actions:

Step Purpose
Check info() Understand structure, data types, nulls

Convert timestamp Enable time-based analysis

Merge datasets Combine ratings with movie metadata

Drop/check duplicates Ensure unique records

Extract year (optional) Add more analytical depth

35
4.Case Study: Lab Program 9 - Movie Ratings and Reviews

1. Introduction to dataset

2. Sample dataset tables (movie, users, ratings)

3. Basic cleaning steps

4. EDA insights for all questions(refer syllabus copy) using charts and
short explanations. Each Question (Q1 to Q8) must include:

0. The code snippet used for analysis

i. The output/visualization generated

ii. A brief interpretation

iii. Document your findings, including visualizations and interpretations.

2. Key takeaways/limitations

Movie Ratings and Reviews (Lab Program 9) – An EDA Case Study

1. Introduction to Dataset
This dataset contains information about more Kannada movies, with various attributes
like movie title, genre, release year, ratings, review count, box office earnings, and
platforms where the movies are available for streaming. This data will be analyzed to
gain insights regarding trends, ratings, and their relationships with other variables.
CODE:
import pandas as pd
import numpy as np
import random

# New Batch of Kannada Movies


movie_titles = [
"Sapta Sagaradaache Ello", "Love Mocktail", "Bhinna",
"Yuvarathnaa", "Roberrt",
"Pogaru", "Love You Rachchu", "French Biriyani", "Bharaate",
"Operation Alamelamma",
"Chamak", "Rathnan Prapancha", "Victory", "Rajkumari",
"Inspector Vikram",
"Ninna Sanihake", "Kotigobba 3", "Ek Love Ya", "Gaalipata 2",
"Lucky Man",
"Trivikrama"
]

genres = [
"Romance", "Romance", "Psychological Thriller", "Action",

36
"Action",
"Action", "Thriller", "Comedy", "Action", "Thriller",
"Romantic Comedy", "Drama", "Comedy", "Drama", "Action",
"Romance", "Action", "Romance", "Comedy", "Fantasy",
"Romance"
]

directors = [
"Hemanth Rao", "Darling Krishna", "Adarsh Eshwarappa", "Santhosh
Ananddram", "Tharun Sudhir",
"Nandakishore", "Shankar Raj", "Pannaga Bharana", "Chethan
Kumar", "Suni",
"Suni", "Rohit Padaki", "Nandakishore", "Girish Kasaravalli",
"Narayan",
"Suraj Gowda", "Sudeep", "Prem", "Yograj Bhat", "Nagendra
Prasad",
"Suresh Kumar"
]

release_years = np.random.randint(2000, 2025, len(movie_titles))

df_movies = pd.DataFrame({
'movie_id': range(1, len(movie_titles) + 1),
'title': movie_titles,
'genre': genres,
'director': directors,
'release_year': release_years
})

df_movies.to_csv('kannada_movies_v2.csv', index=False)

# Generate User Data (same structure)


user_count = 50
df_users = pd.DataFrame({
'user_id': range(1, user_count + 1),

'age': np.random.randint(15, 60, user_count),


'gender': np.random.choice(['Male', 'Female'], user_count),

})

df_users.to_csv('kannada_users_v2.csv', index=False)

# Generate Ratings/Reviews Data


review_sentiments = ["positive", "negative", "neutral"]
sample_reviews = {
"positive": ["loved every scene!", "brilliant acting!",
"gripping story!", "visual treat!", "best movie in years!"],
"negative": ["very slow", "badly written", "overhyped!",
"terrible sound", "flat dialogues"],
"neutral": ["okay movie", "decent", "watchable", "not too bad",
"just fine"]
}

37
ratings_data = []

for i in range(1, 501): # 500 reviews


user_id = random.choice(df_users['user_id'].tolist())
movie_id = random.choice(df_movies['movie_id'].tolist())
rating = random.randint(1, 10)
sentiment = random.choice(review_sentiments)
review = random.choice(sample_reviews[sentiment])

ratings_data.append({
'review_id': i,
'user_id': user_id,
'movie_id': movie_id,
'rating': rating,
'review_text': review,

})

df_reviews = pd.DataFrame(ratings_data)
df_reviews.to_csv('kannada_reviews_v2.csv', index=False)

print("New Kannada movie datasets generated:")


print(" - kannada_movies_v2.csv")
print(" - kannada_users_v2.csv")
print(" - kannada_reviews_v2.csv")

New Kannada movie datasets generated:


- kannada_movies_v2.csv
- kannada_users_v2.csv
- kannada_reviews_v2.csv

import matplotlib.pyplot as plt


import seaborn as sns

# Merge movie and user data to get age information


df_merged = pd.merge(df_reviews, df_users, on='user_id')

# Create age groups


age_bins = [15, 20, 30, 40, 50, 60, 100]
age_labels = ['15-19', '20-29', '30-39', '40-49', '50-59', '60+']
df_merged['age_group'] = pd.cut(df_merged['age'], bins=age_bins,
labels=age_labels)

# Plot the distribution of ratings by age group


plt.figure(figsize=(10, 6))
sns.boxplot(x='age_group', y='rating', data=df_merged,
palette='Set2')
plt.title("Distribution of User Ratings by Age Group")
plt.xlabel('Age Group')

38
plt.ylabel('Rating')
plt.show()

<ipython-input-13-55290d4e7878>:17: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.boxplot(x='age_group', y='rating', data=df_merged,


palette='Set2')

# Merge data to get genre and age group


df_genre_age = pd.merge(df_merged, df_movies[['movie_id', 'genre']],
on='movie_id')

# Calculate average rating for each genre-age group combination


genre_age_avg = df_genre_age.groupby(['age_group', 'genre'])
['rating'].mean().reset_index()

# Plot the results


plt.figure(figsize=(12, 6))
sns.barplot(x='age_group', y='rating', hue='genre',
data=genre_age_avg, palette='Set1')
plt.title("Average Ratings for Different Genres by Age Group")
plt.xlabel('Age Group')
plt.ylabel('Average Rating')
plt.show()

<ipython-input-14-f8479b056aea>:5: FutureWarning: The default of


observed=False is deprecated and will be changed to True in a future
version of pandas. Pass observed=False to retain current behavior or

39
observed=True to adopt the future default and silence this warning.
genre_age_avg = df_genre_age.groupby(['age_group', 'genre'])
['rating'].mean().reset_index()

OUTPUT:

import matplotlib.pyplot as plt


import seaborn as sns

# Merge reviews, users, and movies to get genre information


df_full = pd.merge(df_reviews, df_users, on='user_id')
df_full = pd.merge(df_full, df_movies, on='movie_id') # Now
includes 'genre'

# Calculate the average rating for each genre


genre_avg_ratings = df_full.groupby('genre')
['rating'].mean().reset_index().sort_values(by='rating',
ascending=False)

# Plot the genre-wise average ratings


plt.figure(figsize=(10, 6))
sns.barplot(x='rating', y='genre', data=genre_avg_ratings,
palette='viridis')
plt.title("Average Ratings by Genre")
plt.xlabel('Average Rating')
plt.ylabel('Genre')
plt.xlim(0, 10) # Assuming ratings are on a 1–10 scale
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

40
<ipython-input-19-d25334c2ad32>:13: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `y` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x='rating', y='genre', data=genre_avg_ratings,


palette='viridis')

OUTPUT:

import matplotlib.pyplot as plt


import seaborn as sns

# Merge reviews, users, and movies to get director information


df_full = pd.merge(df_reviews, df_users, on='user_id')
df_full = pd.merge(df_full, df_movies, on='movie_id') # Adds
'director', 'genre', etc.

# Calculate the average rating for each director


director_avg_ratings = df_full.groupby('director')
['rating'].mean().reset_index()

# Sort directors by average rating


director_avg_ratings = director_avg_ratings.sort_values(by='rating',
ascending=False)

# Plot the director-wise average ratings


plt.figure(figsize=(12, 6))
sns.barplot(x='rating', y='director', data=director_avg_ratings,

41
palette='coolwarm')
plt.title("Average Ratings by Director")
plt.xlabel('Average Rating')
plt.ylabel('Director')
plt.xlim(0, 10) # Assuming a 1–10 rating scale
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

<ipython-input-21-e0cfb6c9e74b>:16: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `y` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x='rating', y='director', data=director_avg_ratings,


palette='coolwarm')

OUTPUT:

# Calculate the gender distribution


gender_distribution =
df_users['gender'].value_counts(normalize=True) * 100

# Plot the gender distribution


plt.figure(figsize=(8, 6))
sns.barplot(x=gender_distribution.index,
y=gender_distribution.values, palette='muted')
plt.title("Percentage of Users by Gender")
plt.xlabel('Gender')
plt.ylabel('Percentage (%)')
plt.show()

<ipython-input-22-019ade77b24f>:6: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be

42
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x=gender_distribution.index,
y=gender_distribution.values, palette='muted')

OUPUT:

import matplotlib.pyplot as plt


import seaborn as sns

# Merge review data with user gender


df_gender_ratings = pd.merge(df_reviews, df_users[['user_id',
'gender']], on='user_id')

# Calculate average rating by gender


gender_avg_rating = df_gender_ratings.groupby('gender')
['rating'].mean().reset_index()

# Plot the average rating by gender


plt.figure(figsize=(8, 6))
sns.barplot(x='gender', y='rating', data=gender_avg_rating,
palette='pastel')
plt.title("Average Ratings by Gender")

43
plt.xlabel('Gender')
plt.ylabel('Average Rating')
plt.ylim(0, 10) # Assuming rating scale is 1 to 10
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

<ipython-input-23-db4c071adf52>:12: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x='gender', y='rating', data=gender_avg_rating,


palette='pastel')

# Scatter plot with review length colored by rating


plt.figure(figsize=(10, 6))
sns.scatterplot(x='review_length', y='rating', data=df_reviews,
hue='rating', palette='coolwarm', s=60, alpha=0.7)

# Add title and labels


plt.title("Review Length vs Rating (Colored by Rating)")
plt.xlabel('Review Length (Word Count)')
plt.ylabel('Rating')

44
plt.legend(title="Rating")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

OUTPUT:

import matplotlib.pyplot as plt


import seaborn as sns

# No need to convert release_year — it's already in integer format


# Merge movie data with reviews to get release years
df_release_ratings = pd.merge(df_reviews, df_movies[['movie_id',
'release_year']], on='movie_id')

# Calculate the average rating for each release year


release_year_avg_rating = df_release_ratings.groupby('release_year')
['rating'].mean().reset_index()
release_year_avg_rating =
release_year_avg_rating.sort_values(by='release_year')

# Plot the trend of average ratings over the years


plt.figure(figsize=(10, 6))
sns.lineplot(x='release_year', y='rating',
data=release_year_avg_rating, marker='o', color='b')
plt.title("Average Ratings Over the Years")
plt.xlabel('Release Year')
plt.ylabel('Average Rating')
plt.ylim(0, 10) # Optional: to reflect 1–10 rating scale
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

45
OUTPUT:

Key takeaways/Limitations
Key Takeaways
1. Data Cleaning is Important
Cleaning data is a necessary step before analysis. If your data is messy or wrong,
the results you get from analyzing it can be misleading.
2. Pandas Helps Clean Data
The Python library pandas has many useful tools to clean data. It helps you:
o Fill or remove missing values
o Delete duplicate rows
o Fix data types (like changing text to numbers)
3. Better Data Quality
By using pandas to clean data properly, your data becomes more accurate and
easier to analyze, helping you make better decisions.
Limitations:
1. High Memory Use
Pandas loads the whole dataset into your computer’s memory. If the dataset is
too big, this can slow things down or crash your system.

46
2. Slower with Big Data
With very large datasets, pandas can become slow and may not handle tasks
efficiently.
3. No Built-In Parallel Processing
Pandas doesn’t naturally work in parallel (using multiple CPU cores), so it’s not
the best choice for extremely large or complex tasks.
4. Mixed Data Types Can Be a Problem
If a column contains different types of data (like numbers and text), pandas may
label it as "object", which can make processing that data harder.

47

You might also like