Python & EDA Learning Guide
This guide is designed as a structured learning and reference material for Python, Data
Science basics, and Exploratory Data Analysis (EDA). It starts from the basics of Python,
explains why Python is used in data analysis instead of Excel, introduces important
libraries, and finally provides a step-by-step EDA process with code examples. A ready-to-
run EDA template is also included.
1. Python Basics
Python is a versatile, high-level programming language used in multiple domains:
• Data Science / Machine Learning / Artificial Intelligence
• Web Development (Django, Flask)
• App Development
• Automation / Scripting
Python is popular because it is simple, has a huge community, and comes with many
powerful libraries.
2. Why Python for Data Analysis (vs Excel)
While Excel is great for small datasets, Python offers significant advantages:
• Handles large datasets (millions of rows).
• Reproducible workflows (write code once, re-run anytime).
• Automates repetitive tasks.
• Rich visualization and statistical libraries.
• Essential for Machine Learning and AI.
Therefore, Python is preferred in Data Science.
3. Key Python Libraries for Data Science
3.1 Pandas
Pandas is a Python library for handling structured data (like tables). Think of it as Excel
inside Python, but much more powerful.
Key objects:
• DataFrame → Table (rows & columns).
• Series → Single column.
Example:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
3.2 NumPy
NumPy (Numerical Python) is used for numerical operations. It provides arrays and
mathematical functions used heavily in data processing and machine learning.
3.3 Scikit-learn
A library for Machine Learning. Used for classification, regression, clustering, model
evaluation, etc.
3.4 TensorFlow / PyTorch
These are deep learning frameworks used for building and training Artificial Neural
Networks (ANNs), computer vision, and natural language processing models.
4. Step-by-Step EDA Process
Below are the commonly used steps, code, and syntax in EDA:
1. Import Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
2. Load Data:
df = pd.read_csv('file.csv')
3. Basic Info:
df.head() # First 5 rows
df.tail() # Last 5 rows
df.info() # Column names, data types, nulls
df.describe() # Summary stats
4. Data Cleaning:
df.isnull().sum() # Missing values
df.dropna() # Drop missing rows
df.fillna(value) # Fill missing values
df.duplicated().sum() # Check duplicates
5. Data Exploration:
df['column'].value_counts()
df['column'].unique()
df.corr()
6. Visualization:
sns.histplot(df['col'])
sns.boxplot(x='col', data=df)
sns.heatmap(df.corr(), annot=True)
plt.show()
5. Ready-to-Run Python EDA Template
Here is a template you can directly use by replacing the file path with your dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv('your_file.csv')
# Basic overview
print(df.shape)
print(df.info())
print(df.describe())
# Missing values
print(df.isnull().sum())
# Duplicates
print(df.duplicated().sum())
# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
# Distribution of each column
for col in df.select_dtypes(include=['int64','float64']).columns:
sns.histplot(df[col], kde=True)
plt.show()