How to perform exploratory data analysis
Imagine you have a dataset of students' test scores and demographics.
Here's a simplified step-by-step approach:
1. Load Data:
o Read the data into a table.
o Example: students = pd.read_csv('students.csv')
2. Initial Exploration:
o Look at the first few rows: students.head()
o Check data structure: students.info()
3. Summary Statistics:
o Calculate mean and median of test scores.
o Count unique values in gender column.
4. Handle Missing Data:
o Identify missing entries: students.isnull().sum()
o Fill missing scores with the mean or remove those rows.
5. Visualize Data:
o Histogram of test scores.
o Box plot of test scores by gender.
o Scatter plot of test scores versus study hours.
6. Find Patterns:
o Calculate correlation between study hours and test scores.
o Cross-tabulate test scores and extracurricular participation.
7. Identify Outliers:
o Use IQR to find unusually high or low test scores.
o Use Z-score to find test scores that are far from the average.
8. Feature Engineering:
o Create a new feature combining study hours and class
participation.
Performing Exploratory Data Analysis (EDA) involves several steps, from understanding the structure of
the data to summarizing its main characteristics. Below is a detailed guide on how to perform EDA using
Python with libraries like Pandas, Matplotlib, and Seaborn.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Load data
df = pd.read_csv('your_dataset.csv')
# Understand data structure
print(df.head())
print(df.shape)
print(df.info())
print(df.describe())
# Data cleaning
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
# Univariate analysis
df['column_name'].hist(bins=30)
plt.show()
sns.boxplot(x=df['column_name'])
plt.show()
# Bivariate analysis
plt.scatter(df['column_x'], df['column_y'])
plt.xlabel('column_x')
plt.ylabel('column_y')
plt.show()
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
# Categorical data analysis
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()
# Identifying outliers using IQR
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) | (df['column_name']
> upper_bound)]
print(outliers)
# Identifying outliers using Z-Score
df['z_score'] = stats.zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]
print(outliers)
# Feature engineering
df['new_feature'] = df['feature1'] + df['feature2']
# Visualizing relationships
sns.pairplot(df)
plt.show()
# Hypothesis testing
group1 = df[df['group_column'] == 'group1']['numeric_column']
group2 = df[df['group_column'] == 'group2']['numeric_column']
t_stat, p_value = ttest_ind(group1, group2)
print(f'T-statistic: {t_stat}, P-value: {p_value}')
This workflow provides a structured approach to performing EDA, helping you understand the dataset's
characteristics and relationships before moving on to more complex analysis or modeling.