Step-by-Step EDA in Python
1. Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
2. Load the Data
df = pd.read_csv("data.csv") # Load CSV into a DataFrame
3. Understand the Structure
df.shape # (rows, columns)
df.columns # list of column names
df.info() # summary: datatypes, missing values
df.head() # first 5 rows
df.tail() # last 5 rows
4. Data Types & Summary Stats
df.dtypes # check data types of each column
df.describe() # statistical summary (mean, min, max, quartiles) for numeric columns
df['col'].value_counts() # frequency of unique values in a column
df['col'].unique() # unique values
5. Missing Values & Duplicates
df.isnull().sum() # count missing values per column
df.duplicated().sum() # count duplicate rows
6. Univariate Analysis
# Numerical Column
sns.histplot(df['age'], bins=20, kde=True)
plt.show()
# Categorical Column
sns.countplot(x='gender', data=df)
plt.show()
7. Bivariate Analysis
# Numerical vs Numerical
sns.scatterplot(x='age', y='salary', data=df)
plt.show()
# Numerical vs Categorical
sns.boxplot(x='gender', y='salary', data=df)
plt.show()
# Categorical vs Categorical
pd.crosstab(df['gender'], df['purchased'])
8. Correlation
df.corr(numeric_only=True) # correlation matrix
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.show()
9. Outlier Detection
sns.boxplot(x=df['salary'])
plt.show()
Most Commonly Used Pandas Syntax
df.head() -> Show first 5 rows
df.tail() -> Show last 5 rows
df.shape -> Get (rows, cols)
df.info() -> Data types + null counts
df.describe() -> Stats summary
df['col'].value_counts() -> Frequency of values
df.isnull().sum() -> Missing values
df.dropna() -> Remove missing rows
df.fillna(value) -> Fill missing with given value
df.drop_duplicates() -> Remove duplicates
df.sort_values(by='col') -> Sort by column
df.groupby('col').mean() -> Group and aggregate