Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views5 pages

Amazon Product Data Analysis Report Final

The Amazon Product Data Analysis Report details the cleaning, exploratory data analysis, and visualizations of an Amazon product dataset. Key findings include an average discount of 56.6%, a mean product rating of 4.09, and a weak correlation between ratings and review counts. Machine learning models were built for regression and classification, with the classification model achieving an accuracy of approximately 76.4% in predicting product ratings.

Uploaded by

ashleshpai12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

Amazon Product Data Analysis Report Final

The Amazon Product Data Analysis Report details the cleaning, exploratory data analysis, and visualizations of an Amazon product dataset. Key findings include an average discount of 56.6%, a mean product rating of 4.09, and a weak correlation between ratings and review counts. Machine learning models were built for regression and classification, with the classification model achieving an accuracy of approximately 76.4% in predicting product ratings.

Uploaded by

ashleshpai12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Amazon Product Data Analysis Report

This report presents a comprehensive data analysis of the provided Amazon


product dataset. The analysis includes data cleaning, exploratory data analysis
(EDA), and visualizations to derive key insights about product pricing, ratings,
and categories.

1. Data Cleaning and Preparation


The raw dataset contained several columns with mixed data types and special
characters, which needed to be cleaned for analysis.
• The discounted_price, actual_price, rating, and rating_count columns
were initially stored as strings.
• These columns were cleaned by removing symbols like '₹', '%', and ','
before converting them to a numeric data type (float or integer).
• Any rows with missing rating values were removed, as this is a critical
column for product analysis.
• Duplicate product entries were identified and removed to ensure data
integrity.
• A new column, discount_amount, was created to represent the absolute
discount value for each product, calculated as the difference between
actual_price and discounted_price.
• Code Snippet:

This process resulted in a clean and structured dataset ready for analysis.

2. Exploratory Data Analysis (EDA)


Here are the key descriptive statistics and insights from the cleaned data:

Metric discounted_pr actual_pri discount_percent rating_cou


rating
ice (₹) ce (₹) age (%) nt
1435.0
Count 1435.00 1435.00 1435.00 1435.00
0
Mean 1374.96 3217.43 56.59 4.09 19637.38
Media
449.00 1000.00 60.00 4.10 5064.00
n
129900.0
Max 77990.00 94.00 5.00 426973.00
0
Min 60.00 60.00 0.00 2.00 0.00

Key Findings:
• The average discount across products is approximately 56.6%, indicating
that many items are sold at significantly reduced prices.
• The mean product rating is 4.09, suggesting a generally positive
customer satisfaction.
• The ratings count has a large range, with a few products having an
extremely high number of reviews (max: 426,973). This indicates a
longtailed distribution where a few popular products dominate the
review landscape.

3. Visualizations
The following plots provide a deeper look into the data, revealing trends and
relationships between different variables.
1. Distribution of Product Ratings
This histogram shows that the majority of products have a high rating,
clustering between 4.0 and 4.5. This indicates that most products on the
platform are well-received by customers.
2. Distribution of Discount Percentages
The distribution of discounts is right-skewed, with a prominent peak around
50% to 70%. This confirms the high discount strategy observed in the
descriptive statistics and suggests that products are frequently offered at a
deep markdown.
3. Relationship between Rating and Rating Count
This scatter plot, with a logarithmic scale on the x-axis, reveals a weak positive
correlation between rating and rating count. While highly-rated products
(above 4.0) tend to have more reviews, there are also many products with high
ratings and very few reviews. This suggests that while a high rating is often a
sign of a popular product, it is not the sole determinant of its review count.
4. Relationship between Actual Price and Discounted Price
The plot above shows a clear linear relationship between the actual price and
the discounted price. This pattern suggests a consistent pricing strategy where
discounts are applied proportionally across different price points, with
higherpriced products receiving larger discounts in absolute terms.
5. Top 10 Product Categories
The bar chart highlights the most prevalent product categories in the dataset.
"Electronics" is the most common, followed by "Computers & Accessories"
and "Home & Kitchen". This gives a clear overview of the types of products
included in the dataset, with a strong emphasis on technology and home goods.

Graphical Analysis:
Machine Learning – Model Building
(a) Regression Attempt (SVR)
• Built a pipeline with preprocessing (scaling + one-hot encoding) and
Support Vector Regression.
• Evaluated model with R² score and RMSE.
• Found that R² score was low (~0.1) → meaning regression was not very
predictive.
(b) Classification Approach
• Converted rating into a binary label:
o good_product = 1 if rating ≥ 4.0, else 0.
• Selected features: discounted_price, actual_price, discount_percentage,
rating_count, main_category.
• Built a pipeline with preprocessing + Random Forest Classifier.
• Split data into train/test sets.
• Trained the model and evaluated it using:
o Accuracy
o Precision, Recall, F1-score (via classification_report).
This allowed you to measure how well the model can classify products as good
(≥ 4.0) or bad (< 4.0).

Output:
Classification Performance:
Accuracy: 0.7642857142857142
Classification Report:
precision recall f1-score support

0 0.62 0.31 0.41 75


1 0.79 0.93 0.85 205

accuracy 0.76 280


macro avg 0.70 0.62 0.63 280
weighted avg 0.74 0.76 0.73 280

Results:
o Accuracy score → proportion of correct predictions.
o Classification report → detailed performance per class (precision,
recall, F1)

You might also like