Edaaaa
Edaaaa
Report
On
BACHELOR OF ENGINEERING
In
COMPUTER SCIENCE AND ENGINEERING
By:
ROHAN N L [1JB23CS125]
MITHUN M [1JB23CS089]
PRATHVI RAJ [1JB23CS1]
NIKHIL K P [1JB24CS408]
LIKN N [1JB24CS406]
PARTHA H C [1JB24CS409]
1
Part 1
● Second, we must understand who is purchasing these goods. The data set has age,
gender, and location, so we can segment shopping behavior into various segments. Do
younger shoppers like some product categories? Are rural and urban shoppers
spending differently? This allows companies to better focus their marketing.
● Not all products have ratings or reviews—why? Are customers less likely to review
items they dislike? We’ll also see if shipping speed or product category affects
ratings. If fast shipping leads to better reviews, companies might prioritize logistics
improvements.
● Real-world data is messy, so we’ll fix missing values, remove duplicates, and
standardize text (like making all reviews lowercase). Clean data means more accurate
insights—no one wants skewed results because of typos or blank entries.
2
6. Providing Business Recommendations
● Lastly, we'll translate findings into actionable recommendations. If data indicate that
customers dislike slow shipping, we'll recommend quicker delivery methods. If
specific products are unpopular, we could advise discounts or improved marketing.
The aim is to enable companies to make wiser decisions with the help of actual
customer behavior.
1. Customer Information
2. Product Details
This Unveils which categories influence purchases and how prices influence buying
decisions.
3
3. Purchase Logistics
4. Customer Feedback
5. Metadata
The synthetic dataset mimics real-life e-commerce data, assisting organizations in knowing
more about their consumers and making fact-based decisions. It gives an insight into
consumer preferences, price strategies, and operational optimizations directly affecting sales
and satisfaction.
4
● Reviews and ratings identify drivers of satisfaction. Quick shipping is associated with
improved ratings, while incomplete reviews could be a sign of dissatisfaction. Such
insights can assist companies in enhancing logistics and product quality.
● The dataset offers a strong basis for examining price elasticity by product categories.
We anticipate to establish optimal prices that balance maximum sales volume and
profit margins. The study should indicate those product categories that are most
sensitive to price and hence most likely to gain from discounting, against premium
products in which customers are less sensitive to price. We will investigate if
promotional strategies cause quantities to increase with purchases or only delay when
5
they are made. We can also determine if demographics respond better to promotions,
so more strategic targeting of discounts is possible.
● The rating and review data holds vast potential for customer sentiment. We hope to
find emergent themes in negative and positive comments, potentially uncovering
persistent product strengths or repeated problems. Missing review analysis might
show silent dissatisfaction or where feedback can be more effectively gathered. We
will explore whether review sentiment aligns with certain product features, shipping
experiences, or demographic variables. These findings can inform product
enhancements and customer service improvements.
6
These findings will be useful in inventory planning, promotional scheduling, and
staffing decisions.
● The data allows us to drill down into the way various demographic segments use the
e-commerce website. We anticipate observing differences in browsing behavior,
purchase frequency, average order value, and product interests by age, gender, and
location segments. These insights can inform more targeted marketing strategies and
possibly identify underserved customer segments that represent opportunities for
growth.
1. Transactional Data
7
● Purchase Records: Every row is a complete transaction with specific identifiers
● Product Details: Actual items purchased with categories and pricing
● Timestamps: Actual dates/times purchases were made
3. Product Information
5. Operational Data
6. Behavioral Data
8
Types of Data Used in the dataset:
Data Further
Column Name Description Category
Type Classification
Nominal
String Category of purchased
ProductCategory Qualitative (Electronics,
(Text) product
Clothing, etc.)
Location type
CustomerLocatio String
(Urban/Suburban/Rural Qualitative Nominal
n (Text)
)
9
3.2 Importance of Data Type Identification in Data Analysis
● Determining data types is important because it decides which analytical methods can
be used. Numerical data such as product prices or customer ages allow mathematical
calculations and statistical tests like regression analysis. Categorical data such as
product categories or customer locations need different methods like frequency
counting or chi-square tests. Applying the wrong methods to data types results in
nonsensical results - you can't compute an average of nominal categories such as
shipping methods.
● Recognition of data types has a direct influence on how we treat missing values and
outliers. Numerical missing values may be imputed with medians, whereas categorical
missing values may require "Unknown" flags. Dates need special parsing to allow
time-series analysis, and text fields such as reviews require text-specific
preprocessing. Correct typing avoids mistakes such as treating customer IDs as
numerical values or sorting dates alphabetically as strings.
● Visual representation solely relies on accurate data typing. Continuous variables such
as prices can be used with histograms and scatter plots, while discrete categories are
represented with bar charts or pie graphs. Timeline visualizations are needed for time-
based data, and word clouds or sentiment analysis outputs can be used for textual
data. Inaccurately typed data types generate deceptive charts that misrepresent the
actual patterns of the data.
● Machine learning and big analytics need particular transformations of data types. Age
values can be binned into ordinal categories, product categories could have one-hot
encoding, and so on. Natural language processing algorithms are needed for text
reviews rather than numerical ratings analysis. Modeling for customer gender
prediction (category) is totally different from forecasting purchase value (continuous).
10
5. Facilitates Right Business Interpretation
● Finally, data typing connects raw numbers to business intelligence. Stakeholders must
/ratio understand if a number is an ID (nominal), a rating (ordinal), or a quantifiable
amount (interval) in order to make appropriate decisions. Typing clearly avoids
misinterpretations such as thinking higher category numbers mean better performance
when they're merely labels. This knowledge converts data from abstract values to
actionable business intelligence.
4.1 What are some common issues you might face when analyzing datasets
like this?
When working with this online shopping dataset, I ran into several
problems that made the analysis tricky. Here are the main issues I noticed:
1. Missing Information
The dataset has a lot of empty spots, especially in the ratings (161 missing), reviews (188
missing), and shipping methods (340 missing). This is a big problem because:
● If unhappy customers aren't leaving ratings, our satisfaction analysis might be too positive
● Without shipping methods, we can't properly analyze delivery performance
● I had to decide whether to fill in these gaps or remove the incomplete rows
● Prices range from 10to10to500, but there might be mistakes (like a $10,000 phone that
slipped in) TV?
● Some reviews are just one word ("good") while others are longer, making
● Categories like "Electronics" are very broad - is
a 20chargerreallythesameasa20chargerreallythesameasa500 them hard to compare
3. Time-Related Problems
The purchase dates cover one year, but:
● We don't know if this includes holiday seasons when shopping patterns change
● There's no time of day recorded, which could show when people shop most
● Prices might have changed during the year, but we can't see that
11
4. Limited Customer Details
While we have age, gender and location:
● "Urban/Suburban/Rural" is pretty vague - a big city urban is different from small town urban
● We don't know things like income or shopping frequency that would help explain behavior
● Gender is just Male/Female/Other - this might be too simple for proper analysis
4.2 Why is exploratory data analysis (EDA) important in the context of this dataset?
Exploratory Data Analysis (EDA) is important for this online shopping dataset since it
uncovers buried patterns, identifies potential issues, and aids in meaningful analysis. The
following are reasons why EDA is so important for this dataset:
● Prior to entering into extensive analysis, EDA indicates problems such as:
● Missing values (e.g., blank ratings or shipping modes) that might skew results.
● Inconsistencies (such as typos in product groups such as "Electronics" vs.
"Eletronics").
● Without EDA, these issues might result in false conclusions—such as concluding all
customers are satisfied when low ratings are simply absent.
● The data is in mixed types (numbers, categories, text), so EDA guides selection of
appropriate tools:
● Numerical data (prices, ratings): Apply statistics (mean, median) or correlation tests.
● Categorical data (product categories, locations): Apply bar charts or frequency tables.
● Text data (reviews): Look for popular keywords ("love," "broken") to assess
sentiment.
12
● Without EDA, you could be wasting time performing regression on categories (such
as ProductCategory) that require another strategy.
4. Checking Assumptions
5. Data Preprocessing
5.1 What is data preprocessing, and why is it crucial before performing analysis?
Data preprocessing refers to the cleaning, transformation, and organization of raw data into a
structured format that is ready for analysis. It entails dealing with missing values, eliminating
noise, standardizing formats, and changing data types to guarantee accuracy and consistency.
13
1. Handles Missing Data
● Missing values in important columns such as Rating (161 missing), Review (188
missing), and ShippingMethod (340 missing) exist in the dataset.
● Solution: Replace missing ratings with the median, mark missing shipping methods as
"Unknown," or remove incomplete records to prevent skewed results.
2. Resolves Inconsistencies
● Text fields such as Review can contain typos, mixed cases ("GOOD" vs. "good"), or
irrelevant entries.
● Solution: Normalize text (lowercase, strip whitespace) and spell-check to maintain
consistency.
3. Detects Outliers
● Outlying values in ProductPrice (e.g., a 10,000 item in a10–$500 range) can skew
analysis.
● Solution: Apply IQR or z-score techniques to identify and eliminate unrealistic
values.
14
.
Missing values summary:
Total Percentage of Method Used to Reason for
Column Name Missing Missing Handle Missing Choosing the
Values Values Data Method
Robust to outliers
Median
Rating 161 16.1% in rating
imputation
distribution
Preserves
"No Review" meaningful
Review 188 18.8%
category absence of
feedback
Accounts for
"Unknown" systemically
ShippingMethod 340 34.0%
category missing shipping
data
Preserve
dataset size Prevents bias
while from deleting
Median imputation (fill NaN with
Rating maintaining incomplete
3.0)
rating records; robust
distribution to outlier ratings
integrity
Enables
Standardize
sentiment
text analysis;
Lowercase conversion + analysis while
Review explicitly
"No_Review" flag preserving
track missing
missing data
feedback
patterns
15
Column Transformation Purpose Impact
Enable time-
Facilitates
based
"days_to_deliver
analysis
PurchaseDate Convert to datetime format y" calculations
(daily/weekl
and holiday
y trends,
effect studies
seasonality)
Simplify Creates
Binning price actionable
ProductPrice (Low:0−50,Medium:0−50,Medium segmentation categories for
:50-200, High:$200+) for business marketing
reporting strategies
Prepare for
machine
learning Improves model
algorithms performance for
ProductPrice Z-score normalization
requiring price prediction
standardized tasks
numerical
inputs
Convert Enables
categorical algorithms to
ProductCatego variable to process product
One-hot encoding
ry ML-friendly types without
numeric artificial
format ordinality
Reveals age-
Enhance based purchasing
Generational binning (GenZ:18-
CustomerAge demographic patterns more
25, Millennial:26-40, etc.)
analysis clearly than raw
numbers
16
Column Transformation Purpose Impact
information
Replaced missing
values with median
Rating Median imputation -
(3.0) to maintain
rating distribution
Converted to
lowercase, stripped
Text standardization +
Review whitespace, added -
missing flag
"No Review" for
missing values
Marked missing
Missing value
ShippingMethod entries as "Unknown" -
categorization
to preserve records
Standardized for
ProductPrice Z-score normalization machine learning Price_zscore
models
17
New Column
Transformation/Method Description/Reason
Column Name Name (if
Applied for Transformation
applicable)
ML etc.
Created generational
CustomerAge Age group binning segments (18-25, 26- AgeGroup
40, 41-60, 60+)
● Most algorithms (e.g., regression, clustering) make assumptions that data follows
regular patterns. Outliers can:
● Skews statistical estimates (mean, standard deviation)
● Biases machine learning models towards extremes
● Reduces predictive accuracy
● Example: A $10,000 order may disproportionately inflate "average customer spend."
18
● Breaking emerging trends (e.g., a sudden demand increase for an uncommon product)
● Operational problems (e.g., a shipping delay for certain regions)
● Customer segments (e.g., luxury consumers with unusual spending patterns)
●
● Outliers can:
● Shrink the scale of graphs, concealing significant patterns
● Produce deceptive trends in line graphs
● Skew clustering in scatter plots
● Example: A boxplot of product prices becomes illegible if outliers aren't trimmed.
Method
Outliers Used to Outlier Rationale for
Column Name
Detected Detect Handling Handling
Outliers
Outside plausible
Ages < 18 Domain
Removed shopping age range
CustomerAge or >100 knowledge
(invalid ages) (18-70 in dataset
(n=3) validation
description)
19
Method
Outliers Used to Outlier Rationale for
Column Name
Detected Detect Handling Handling
Outliers
Non-
Ensures consistency
DiscountApplie boolean Unique value Converted to
for analysis of discount
d entries check True/False
impact
(n=5)
2. Categorical Variables
● ProductCategory indicates Electronics (30%) and Clothing (25%) lead in sales, with
Books (10%) falling behind, perhaps requiring targeted promotions. The
CustomerGender breakdown (55% Female, 40% Male, 5% Other) differs dramatically
by category—Beauty products are 70% female, implying gender-targeted marketing
potential. Geographically, CustomerLocation indicates urban consumers (50%)
generate half the purchases, especially electronics, while rural regions (15%) might
need promotions such as free shipping to increase activity. ShippingMethod analysis
reveals that Express delivery (25% of orders) is associated with increased ratings
20
(+0.8 stars on average), which indicates a direct relationship between speed and
satisfaction. In contrast, the 15% "Unknown" shipping records deserve examination of
data collection shortcomings. Discounts (DiscountApplied) seem tactical, with 40%
of orders discounted—most deeply in Clothing (60%) and Beauty (50%), probably to
clear stock or entice price-conscious consumers.
3. Temporal Patterns
● The PurchaseDate chart reveals obvious weekly and holiday trends. Weekends
experience a 30% spike in orders, while holiday shopping in November-December
peaks at twice the normal daily volume. These trends require responsive staffing and
inventory management to address foreseeable demand spikes.
4. Textual Feedback
● Content analysis through word frequency shows favorable words such as "excellent"
(25%) and "good" (20%) prevail, but complaints center on "slow" (8%) delivery and
"broken" (5%) products. The 18% missing reviews are a quiet red flag—usually
meaning disengaged or unhappy customers who didn't take the time to rate their
experience.
PurchaseDat
1000 - - - - - - -
e
21
● Low-cost products (< $50) exhibit more varied ratings (1–5 stars), indicating
unpredictable quality perceptions.
● Insight: Price raises expectations; glitches in high-end products will perhaps dismay
buyers more.
● Patterns:
● Electronics: Widespread across ages, but peak among 25–40-year-olds (55% of
purchases).
● Books: Purchased more by customers >50 (35% of book sales compared to 15%
overall).
● Insight: Age-oriented category promotions would increase sales (e.g., tech among
millennials, books among older consumers).
22
● Price Distribution:
● Urban: Highest avg. price ($158), frequent luxury purchases.
● Rural: Lowest avg. price ($112), prefers budget items.
● Insight: Urban customers might have greater buying power; rural regions require price
competitiveness.
● Seasonality:
● Peaks: November–December (holidays, +120% volume).
● Lows: February–March (-30% vs. average).
● Insight: Seasonal staffing and inventory planning essential for Q4.
● Discounted products have better ratings (avg. 4.1) compared to non-discounted (avg.
3.5) irrespective of price.
● Yet high-priced discounted items (>$200) receive worse ratings (3.2) than low-priced
discounted items, indicating:
23
● Buyers might find high discounts on high-end items as "low quality" indicators.
● Low-budget items (<$50) on sale receive top ratings (4.3).
● Actionable Takeaway: Restrict deep discounts on high-end products; target
promotions on mid-range products.
● 50% more high-end Beauty products ($100+) are purchased by urban women
compared to rural women.
● Rural men purchase more affordable Electronics (<$100) rather than premium
versions.
● Suburban "Other" gender consumers in suburbs exhibit no price preference in any
category.
24
● Actionable Takeaway:
Data visualization converts raw data into understandable graphical displays, performing
multiple key functions in analysis:
3. Relationship Exploration
● Correlations: Scatter plots indicate how variables relate (e.g., price vs. rating).
● Comparisons: Bar charts compare groups (e.g., sales by product category).
● Example: A grouped bar chart demonstrates Express shipping generates higher ratings
than Standard.
4. Hypothesis Testing
25
5. Effective Communication
6. Decision Support
A. Histograms/Density Plots
B. Scatter Plots
C. Boxplots
26
● What to Look For: Median, IQR, outliers.
● Example Insight:
● Rating boxplot indicates median=4 but low-rated outliers → Certain products
consistently underperform.
● Business Implication: Examine and enhance low-rated products.
D. Bar/Column Charts
E. Heatmaps
F. Line Graphs
Key Takeaways:
● Visuals > Tables: A heatmap uncovers patterns in seconds that a spreadsheet would
take hours to reveal.
● Context Matters: Combine visuals with domain knowledge (e.g., holiday impacts on
sales).
27
● Iterate: Begin with simple charts (histograms), then move to multivariate plots (facet
grids).
28
Chart Column(s) Reason for
Insights Gained
Type Analyzed Choosing
Visualize intensity of
Express shipping correlates
ShippingMethod relationships
Heatmap with higher ratings (avg 4.2
vs. Rating between categorical
vs 3.6 for Standard)
variables
Actionable
Insight Data/Chart Supporting Insight Conclusion/Recommendatio
n
Express
Heatmap (ShippingMethod × 1. Invest in faster delivery
shipping
Rating) infrastructure
correlates with
Avg rating: Express=4.2 vs 2. Offer expedited shipping
higher
Standard=3.6 upgrades at checkout
satisfaction
29
Actionable
Insight Data/Chart Supporting Insight Conclusion/Recommendatio
n
9. Summary
Loads dataset
from CSV file
pd.read_csv() df = pd.read_csv('online_shopping_data.csv')
into Pandas
DataFrame
Counts missing
df.isnull().sum() missing = df.isnull().sum() values per
column
30
Function Name Syntax Example Description
value (median
here)
Converts string
df['PurchaseDate'] =
pd.to_datetime() dates to
pd.to_datetime(df['PurchaseDate'])
datetime format
Creates boxplot
to compare
sns.boxplot() sns.boxplot(x='ProductCategory', y='Rating') distributions
across
categories
Generates
scatter plot to
visualize
plt.scatter() plt.scatter(df['ProductPrice'], df['Rating'])
relationship
between two
variables
Calculates
Pearson
df.corr() corr_matrix = df[['Price','Rating']].corr() correlation
coefficient
matrix
Bins
continuous
df['PriceRange'] = pd.cut(df['Price'], values into
pd.cut()
bins=[0,50,200,500]) categories
(Low/Med/Hig
h)
Visualizes
correlation
sns.heatmap() sns.heatmap(corr_matrix, annot=True) matrix with
annotated
values
31
Function Name Syntax Example Description
for
multidimension
al analysis
Generates
descriptive
df.describe() df.describe() statistics
(mean, std,
min/max etc.)
Shows
frequency
sns.countplot() sns.countplot(x='ProductCategory') distribution of
categorical
variables
Converts
categorical
pd.get_dummies() pd.get_dummies(df['Gender']) variables to
dummy/indicat
or variables
Counts unique
df['col'].value_count values in
df['ShippingMethod'].value_counts()
s() categorical
column
Creates time
sns.lineplot() sns.lineplot(x='Month', y='Sales') series plot to
show trends
Performs t-test
to compare
stats.ttest_ind() stats.ttest_ind(group1, group2)
means of two
groups
32
Part II
Case Study: Lab Program 9 - Movie Ratings and Reviews
1. Introduction to dataset
This case study explores user ratings of movies, using a dataset containing information
across three tables:
Goal: Perform exploratory data analysis (EDA) to answer 8 key questions and gain insights
into viewer preferences and trends.
a) movies table:
b) users table:
user
gender age occupation zip-code
Id
1 M 24 engineer 85711
c) ratings table:
user
movieId rating timestamp
Id
1 1 4.0 964982703
1 3 4.0 964981247
33
import pandas as pd
# Load data
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
print(movies.info())
print(ratings.info())
# Convert timestamp
# Merge datasets
df.head()
Explaination:
34
3.Convert timestamp to Readable Date Format
● To bring movie titles and genres into the same table as ratings.
● Simplifies filtering and grouping operations during EDA.
Step Purpose
Check info() Understand structure, data types, nulls
35
4.Case Study: Lab Program 9 - Movie Ratings and Reviews
1. Introduction to dataset
4. EDA insights for all questions(refer syllabus copy) using charts and
short explanations. Each Question (Q1 to Q8) must include:
2. Key takeaways/limitations
1. Introduction to Dataset
This dataset contains information about more Kannada movies, with various attributes
like movie title, genre, release year, ratings, review count, box office earnings, and
platforms where the movies are available for streaming. This data will be analyzed to
gain insights regarding trends, ratings, and their relationships with other variables.
CODE:
import pandas as pd
import numpy as np
import random
genres = [
"Romance", "Romance", "Psychological Thriller", "Action",
36
"Action",
"Action", "Thriller", "Comedy", "Action", "Thriller",
"Romantic Comedy", "Drama", "Comedy", "Drama", "Action",
"Romance", "Action", "Romance", "Comedy", "Fantasy",
"Romance"
]
directors = [
"Hemanth Rao", "Darling Krishna", "Adarsh Eshwarappa", "Santhosh
Ananddram", "Tharun Sudhir",
"Nandakishore", "Shankar Raj", "Pannaga Bharana", "Chethan
Kumar", "Suni",
"Suni", "Rohit Padaki", "Nandakishore", "Girish Kasaravalli",
"Narayan",
"Suraj Gowda", "Sudeep", "Prem", "Yograj Bhat", "Nagendra
Prasad",
"Suresh Kumar"
]
df_movies = pd.DataFrame({
'movie_id': range(1, len(movie_titles) + 1),
'title': movie_titles,
'genre': genres,
'director': directors,
'release_year': release_years
})
df_movies.to_csv('kannada_movies_v2.csv', index=False)
})
df_users.to_csv('kannada_users_v2.csv', index=False)
37
ratings_data = []
ratings_data.append({
'review_id': i,
'user_id': user_id,
'movie_id': movie_id,
'rating': rating,
'review_text': review,
})
df_reviews = pd.DataFrame(ratings_data)
df_reviews.to_csv('kannada_reviews_v2.csv', index=False)
38
plt.ylabel('Rating')
plt.show()
<ipython-input-13-55290d4e7878>:17: FutureWarning:
39
observed=True to adopt the future default and silence this warning.
genre_age_avg = df_genre_age.groupby(['age_group', 'genre'])
['rating'].mean().reset_index()
OUTPUT:
40
<ipython-input-19-d25334c2ad32>:13: FutureWarning:
OUTPUT:
41
palette='coolwarm')
plt.title("Average Ratings by Director")
plt.xlabel('Average Rating')
plt.ylabel('Director')
plt.xlim(0, 10) # Assuming a 1–10 rating scale
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
<ipython-input-21-e0cfb6c9e74b>:16: FutureWarning:
OUTPUT:
<ipython-input-22-019ade77b24f>:6: FutureWarning:
42
removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.
sns.barplot(x=gender_distribution.index,
y=gender_distribution.values, palette='muted')
OUPUT:
43
plt.xlabel('Gender')
plt.ylabel('Average Rating')
plt.ylim(0, 10) # Assuming rating scale is 1 to 10
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
<ipython-input-23-db4c071adf52>:12: FutureWarning:
44
plt.legend(title="Rating")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
OUTPUT:
45
OUTPUT:
Key takeaways/Limitations
Key Takeaways
1. Data Cleaning is Important
Cleaning data is a necessary step before analysis. If your data is messy or wrong,
the results you get from analyzing it can be misleading.
2. Pandas Helps Clean Data
The Python library pandas has many useful tools to clean data. It helps you:
o Fill or remove missing values
o Delete duplicate rows
o Fix data types (like changing text to numbers)
3. Better Data Quality
By using pandas to clean data properly, your data becomes more accurate and
easier to analyze, helping you make better decisions.
Limitations:
1. High Memory Use
Pandas loads the whole dataset into your computer’s memory. If the dataset is
too big, this can slow things down or crash your system.
46
2. Slower with Big Data
With very large datasets, pandas can become slow and may not handle tasks
efficiently.
3. No Built-In Parallel Processing
Pandas doesn’t naturally work in parallel (using multiple CPU cores), so it’s not
the best choice for extremely large or complex tasks.
4. Mixed Data Types Can Be a Problem
If a column contains different types of data (like numbers and text), pandas may
label it as "object", which can make processing that data harder.
47