In this project we will dive into the Olist E-Commerce database and perform a series of analysis to extract business related information such as :
- Order shipping performance:
- Delay on shippment or on delivery;
- freight values;
- Order review analysis:
- Sentiment analisys;
- review score vs order value and delivery delays;
- Customer profile:
- customer distribution by city and state;
- average order ticket per customer per region;
- Purchase analysis:
- orders per city-state;
- payment analysis;
- Sellers performance:
- top sellers and best reviewed;
- Product analysis:
- products with most orders;
- top products by selling and reviews;
The database used in this project was published by an E-Commerce platform from Brazil called Olist and contains data from orders made in their platform between 2016 and 2018 and is avaliable on kaggle in the following link: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
All data has been pre-processed, cleaned and anonymised by Olist before publishing the database, and references to the companies and partners in the review text have been replaced with the names of Game of Thrones great houses.
The data is distributed through 8 tables connected following the Data Schema showed on Figure-1:
- olist_customers_dataset
- olist_geolocation_dataset
- olist_order_items_dataset
- olist_order_payments_dataset
- olist_order_reviews_dataset
- olist_orders_dataset
- olist_products_dataset
- olist_sellers_dataset
For each topic of the introduction, we will try to use the following Data Analysis Framework:
All data was stored in a MySQL database and the access made via SQL queries directly through python code on the jupyter notebooks.
Anomymization was already done by the dataset owner so this step was skipped.
No extra data was necesary to perform the analysis on this project.
Each process of cleaning, filtering and transforming is explained inside each notebook alongside its code.
The main tools and libraries used were:
- MySQL Workbench (develop and tes all queries);
- Jupyter Notebook (document all steps and process all codes)
- Pandas, Matplotlib and Seaborn (data manipulation and graph creation)
Before beginning to explore operation performance characteristics such as delivery performance, top selling products, order review and impacts of delays on customer satisfaction, is better to have an overview of the distribution of customers and sellers in order to have a better understanding on the Olist commercial positioning and analyze possible improvements on logistics strategy to increase marketshare and customer satisfaction based on operation performance and customer feedback.
Following there is the demographic distribution of customers, sellers, orders and revenue:
As one of the biggest challenges (if not the biggest) of e-commerce, delivery time and cost impact a lot on any online company's performance.
This way, let's explore the performance of the platforms shipping and delivery process by answering the following questions:
1. How many orders were shipped with delay?
2. What are the regions with most delays?
3. Are there any seller with high delay rate?
4. How many orders were delivered with delay?
5. How shipping delay impacts delivery performance?
6. Are there any relation between delivery delay and the distance between sellers and customers?
7. How does freight value behaves among states?
The jupyter notebook "Order shipping and delivery performance" navigate through these questions, with answers and insights
GNU General Public License version 3
Leave a star in GitHub, give a clap in Medium and share this guide if you found this helpful.


