Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Repository using SQL and Python to perform a simple analysis of the kaggle "Brazilian E-Commerce Public Dataset by Olist"

Notifications You must be signed in to change notification settings

hbeltrao/Olist-EDA-Project

Repository files navigation

Banner

Diving into Brazilian E-Commerce Public Dataset by Olist using SQL and Python

GitHub release (latest by date including pre-releases) GitHub last commit GitHub issues GitHub pull requests GitHub

In this project we will dive into the Olist E-Commerce database and perform a series of analysis to extract business related information such as :

  • Order shipping performance:
    • Delay on shippment or on delivery;
    • freight values;
  • Order review analysis:
    • Sentiment analisys;
    • review score vs order value and delivery delays;
  • Customer profile:
    • customer distribution by city and state;
    • average order ticket per customer per region;
  • Purchase analysis:
    • orders per city-state;
    • payment analysis;
  • Sellers performance:
    • top sellers and best reviewed;
  • Product analysis:
    • products with most orders;
    • top products by selling and reviews;

Table of contents

Database Context and structure

(Back to top)

The database used in this project was published by an E-Commerce platform from Brazil called Olist and contains data from orders made in their platform between 2016 and 2018 and is avaliable on kaggle in the following link: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

All data has been pre-processed, cleaned and anonymised by Olist before publishing the database, and references to the companies and partners in the review text have been replaced with the names of Game of Thrones great houses.

The data is distributed through 8 tables connected following the Data Schema showed on Figure-1:

  1. olist_customers_dataset
  2. olist_geolocation_dataset
  3. olist_order_items_dataset
  4. olist_order_payments_dataset
  5. olist_order_reviews_dataset
  6. olist_orders_dataset
  7. olist_products_dataset
  8. olist_sellers_dataset

Figure-1

Methodology

(Back to top)

For each topic of the introduction, we will try to use the following Data Analysis Framework:

Figure-2

All data was stored in a MySQL database and the access made via SQL queries directly through python code on the jupyter notebooks.

Anomymization was already done by the dataset owner so this step was skipped.

No extra data was necesary to perform the analysis on this project.

Each process of cleaning, filtering and transforming is explained inside each notebook alongside its code.

The main tools and libraries used were:

  • MySQL Workbench (develop and tes all queries);
  • Jupyter Notebook (document all steps and process all codes)
  • Pandas, Matplotlib and Seaborn (data manipulation and graph creation)

Customers and sellers demography

(Back to top)

Before beginning to explore operation performance characteristics such as delivery performance, top selling products, order review and impacts of delays on customer satisfaction, is better to have an overview of the distribution of customers and sellers in order to have a better understanding on the Olist commercial positioning and analyze possible improvements on logistics strategy to increase marketshare and customer satisfaction based on operation performance and customer feedback.

Following there is the demographic distribution of customers, sellers, orders and revenue:

Order shipping and delivery performance

(Back to top)

As one of the biggest challenges (if not the biggest) of e-commerce, delivery time and cost impact a lot on any online company's performance.

This way, let's explore the performance of the platforms shipping and delivery process by answering the following questions:

1. How many orders were shipped with delay?
2. What are the regions with most delays?
3. Are there any seller with high delay rate?
4. How many orders were delivered with delay?
5. How shipping delay impacts delivery performance?
6. Are there any relation between delivery delay and the distance between sellers and customers?
7. How does freight value behaves among states?

The jupyter notebook "Order shipping and delivery performance" navigate through these questions, with answers and insights

License

(Back to top)

GNU General Public License version 3

Footer

(Back to top)

Leave a star in GitHub, give a clap in Medium and share this guide if you found this helpful.

About

Repository using SQL and Python to perform a simple analysis of the kaggle "Brazilian E-Commerce Public Dataset by Olist"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published