Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views2 pages

EDA Step by Step

The document outlines a step-by-step guide for performing Exploratory Data Analysis (EDA) in Python using libraries like pandas, numpy, matplotlib, and seaborn. It covers data loading, structure understanding, missing values, univariate and bivariate analysis, correlation, and outlier detection. Additionally, it includes commonly used pandas syntax for data manipulation and analysis.

Uploaded by

Pranay Tandel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views2 pages

EDA Step by Step

The document outlines a step-by-step guide for performing Exploratory Data Analysis (EDA) in Python using libraries like pandas, numpy, matplotlib, and seaborn. It covers data loading, structure understanding, missing values, univariate and bivariate analysis, correlation, and outlier detection. Additionally, it includes commonly used pandas syntax for data manipulation and analysis.

Uploaded by

Pranay Tandel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Step-by-Step EDA in Python

1. Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the Data


df = pd.read_csv("data.csv") # Load CSV into a DataFrame

3. Understand the Structure


df.shape # (rows, columns)
df.columns # list of column names
df.info() # summary: datatypes, missing values
df.head() # first 5 rows
df.tail() # last 5 rows

4. Data Types & Summary Stats


df.dtypes # check data types of each column
df.describe() # statistical summary (mean, min, max, quartiles) for numeric columns
df['col'].value_counts() # frequency of unique values in a column
df['col'].unique() # unique values

5. Missing Values & Duplicates


df.isnull().sum() # count missing values per column
df.duplicated().sum() # count duplicate rows

6. Univariate Analysis
# Numerical Column
sns.histplot(df['age'], bins=20, kde=True)
plt.show()

# Categorical Column
sns.countplot(x='gender', data=df)
plt.show()
7. Bivariate Analysis
# Numerical vs Numerical
sns.scatterplot(x='age', y='salary', data=df)
plt.show()

# Numerical vs Categorical
sns.boxplot(x='gender', y='salary', data=df)
plt.show()

# Categorical vs Categorical
pd.crosstab(df['gender'], df['purchased'])

8. Correlation
df.corr(numeric_only=True) # correlation matrix
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.show()

9. Outlier Detection
sns.boxplot(x=df['salary'])
plt.show()

Most Commonly Used Pandas Syntax


df.head() -> Show first 5 rows
df.tail() -> Show last 5 rows
df.shape -> Get (rows, cols)
df.info() -> Data types + null counts
df.describe() -> Stats summary
df['col'].value_counts() -> Frequency of values
df.isnull().sum() -> Missing values
df.dropna() -> Remove missing rows
df.fillna(value) -> Fill missing with given value
df.drop_duplicates() -> Remove duplicates
df.sort_values(by='col') -> Sort by column
df.groupby('col').mean() -> Group and aggregate

You might also like