This project involves cleaning and preparing the 2024 layoffs dataset for analysis using Python, SQL, and Power BI. After cleaning the raw data, various visualizations were created in Power BI to analyze trends in layoffs by year, industry, location, and more.
- Project Overview
- Objective
- Technologies Used
- Dataset Description
- Data Cleaning Process
- Power BI Visualizations
- Code Explanation
- Results
- How to Run the Project
- Credits
- License
This project focuses on preparing and visualizing the 2024 layoffs dataset. After cleaning the raw data by addressing issues like missing values, duplicates, and inconsistent formats, I used Power BI to create various visualizations that uncover insights from the data. This includes a line chart to show layoffs by year and industry, a map chart for layoffs by country, and a summary card to visualize the total number of layoffs.
The goal of this project is to:
- Clean and prepare the 2024 layoffs dataset for analysis.
- Create meaningful visualizations to help stakeholders understand trends and patterns in mass layoffs.
- πPython (Pandas, NumPy)
- π’οΈSQL (SQLite)
- πPower BI (for visualizations)
- π Excel (for initial inspection and analysis)
The dataset contains information about layoffs from 2020 to 2024, scraped from Layoffs.fyi. The purpose of the dataset is to enable the analysis of recent mass layoffs and discover patterns in the layoffs across industries and countries.
- Data Source: Layoffs.fyi
- Credits: Roger Lee
The data cleaning process involved several key steps:
- Removing Duplicates: Duplicate rows were removed to ensure data integrity.
- Imputing Missing Values:
- Numeric columns: Missing values were replaced with the rounded mean of the respective columns.
- Categorical columns: Missing values were filled with the mode (most frequent value) of each column.
- Normalizing Data Formats:
- Text columns were converted to lowercase for consistency.
- Date columns were standardized to the 'YYYY-MM-DD' format.
- Data Type Standardization: Ensured that all columns had the correct data types for analysis.
After cleaning the data, I used Power BI to create a set of insightful visualizations:
-
π Line Chart: Company Layoffs by Year and Industry
-
π Map Chart: Layoffs by Country
-
π Summary Card: Total Layoffs (1.03M)
- A card visual was used to display the total number of layoffs across all years and industries, which came to 1.03 million.
- This provides a quick summary of the data for stakeholders.
-
π Dashboard: Considated View
- A consolidated view of all three visualizations in the Power BI Dashboard. This dashboard combines the line chart π, map chart π, and summary card π into a comprehensive, interactive interface for stakeholders to explore layoffs trends, geographic distribution, and overall totals.
Below is a summary of the key Python code used for cleaning the data:
import pandas as pd
import numpy as np
import sqlite3
# Load the dataset
data = pd.read_csv(r'csv/layoffs_2024.csv')
# Initial data inspection
print("Initial data structure:")
print(data.info())
# Remove duplicate rows
data = data.drop_duplicates()
# Impute missing values
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns
for column in numeric_columns:
mean_value = round(data[column].mean(), 2)
data[column] = data[column].fillna(mean_value)
categorical_columns = data.select_dtypes(include=['object']).columns
for column in categorical_columns:
mode_value = data[column].mode()[0]
data[column] = data[column].fillna(mode_value)
# Normalize text data
for column in categorical_columns:
data[column] = data[column].str.lower()
# Standardize date formats
if 'date' in data.columns:
data['date'] = pd.to_datetime(data['date']).dt.strftime('%Y-%m-%d')
# Ensure correct data types
data['total_laid_off'] = pd.to_numeric(data['total_laid_off'], errors='coerce')
data['percentage_laid_off'] = pd.to_numeric(data['percentage_laid_off'], errors='coerce')
data['funds_raised'] = pd.to_numeric(data['funds_raised'], errors='coerce')
# Save cleaned data to CSV and SQL
cleaned_file_path = 'cleaned_layoffs_2024.csv'
data.to_csv(cleaned_file_path, index=False)
conn = sqlite3.connect('cleaned_layoffs.db')
data.to_sql('cleaned_layoffs', conn, if_exists='replace', index=False)
conn.close()- Load Dataset: The dataset is loaded from a CSV file.
- Remove Duplicates: Duplicate rows are dropped.
- Impute Missing Values: Missing values in numeric and categorical columns are imputed using the mean and mode, respectively.
- Normalize Data: Text data is converted to lowercase, and date columns are standardized.
- Save the Cleaned Data: The cleaned data is saved in both CSV and SQLite formats.
- Data Cleaned: The dataset was cleaned of duplicate entries, and missing values were imputed.
- Visualizations Created: After cleaning the data, the following Power BI visualizations were created:
- Line Chart: Showed layoffs by year and industry from 2020 to 2024.
- Map Chart: Visualized the geographic distribution of layoffs by country.
- Summary Card: Displayed the total number of layoffs (1.03M).
- Clone or download the repository.
- Make sure you have the required Python libraries installed:
pip install pandas numpy sqlite3
- Place the
layoffs_2024.csvfile in thecsv/directory. - Run the Python script to clean the dataset:
python Data_Cleaning_and_Preparation.py
- Import the cleaned dataset (
cleaned_layoffs_2024.csv) into Power BI for visualization.
- Dataset: Layoffs.fyi
- Author: Roger Lee
- Project Developer: Willie Conway β¨
This project is licensed under the MIT License - see the LICENSE file for details.