What is Pandas Library?
Pandas is an open-source Python library widely used for data manipulation, analysis, and
preprocessing tasks. It is a fundamental library for data science and analytics and is built on top
of NumPy, providing high-level data structures and methods to work with structured data
efficiently.
Pandas primarily offers two data structures for handling data:-
Series: A one-dimensional labeled array capable of holding any data type.
DataFrame: A two-dimensional labeled data structure, similar to a table in relational databases
or an Excel spreadsheet, consisting of rows and columns.
Features of Pandas:-
• Fast and Efficient Data Manipulation: Pandas provides a variety of functions to
manipulate, clean, and analyze data efficiently.
• Handling Missing Data: Pandas can detect, fill, or remove missing values, making it easier
to preprocess datasets.
• Data Alignment and Merging: It supports database-like operations, such as merging,
joining, and reshaping data.
• Label-Based Slicing and Indexing: Pandas allows access to data using row/column labels
as well as positional indexing.
• Group By Functionality: It provides powerful group-by capabilities, allowing you to split
data, apply functions, and combine results.
• Data Cleaning: Pandas simplifies tasks like renaming columns, handling missing values,
or removing duplicates.
• Support for Time-Series Data: Pandas provides specialized tools for handling time-series
data, including date-based indexing, resampling, and rolling-window calculations.
1.Installing Pandas:-
Before using Pandas, ensure that it’s installed. You can install it using pip if it’s not already
installed
pip install pandas
Requirement already satisfied: pandas in
/usr/local/lib/python3.10/dist-packages (2.1.4)
Requirement already satisfied: numpy<2,>=1.22.4 in
/usr/local/lib/python3.10/dist-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in
/usr/local/lib/python3.10/dist-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in
/usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2-
>pandas) (1.16.0)
2.Importing Pandas:-
To start working with Pandas, you first need to import it into your Python script
import pandas as pd
3.Pandas Data Structures:-
Series: A Series is essentially a one-dimensional array, similar to a column in a spreadsheet or a
list in Python, but with labels (called index).
import pandas as pd
# Creating a Series
data = [1, 3, 5, 7, 9]
series = pd.Series(data)
print(series)
• Indexing: You can access the elements of a Series using its index.
print(series[2]) # Outputs: 5
• Custom Index: You can also define custom labels for the Series index.
series = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(series['c']) # Outputs: 5
DataFrame: A DataFrame is a two-dimensional data structure, similar to a table in a relational
database or an Excel spreadsheet. It consists of rows and columns.
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)
4.Reading Data into Pandas:- Pandas makes it easy to load data from various file formats, such
as CSV, Excel, and SQL databases.
• Reading CSV Files
# Reading data from a CSV file
df = pd.read_csv('data.csv')
print(df.head()) # Prints the first 5 rows of the DataFrame
• Reading Excel Files
# Reading data from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
• Reading from SQL Databases
import sqlite3
# Connecting to a SQL database and reading data into a DataFrame
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM tablename', conn)
5.Basic Operations with DataFrames:-
• Viewing Data:
.head(): Displays the first few rows of the DataFrame.
.tail(): Displays the last few rows of the DataFrame.
print(df.head()) # View the first 5 rows
print(df.tail()) # View the last 5 rows
• Inspecting Data:
.shape: Returns the dimensions of the DataFrame (rows, columns).
.columns: Returns the column names.
.info(): Provides a concise summary of the DataFrame, including the data types and
non-null counts.
.describe(): Provides descriptive statistics for numeric columns.
print(df.shape) # Get the shape (rows, columns)
print(df.columns) # Get the column names
df.info() # Get information about the DataFrame
print(df.describe()) # Get summary statistics
6.Selecting Data from a DataFrame:-
You can select specific columns or rows using loc and iloc
• Selecting Columns:
# Selecting a single column
age_column = df['Age']
# Selecting multiple columns
subset = df[['Name', 'City']]
• Selecting Rows:
– loc: Select rows and columns by label.
– iloc: Select rows and columns by position (index).
# Selecting rows by index using loc
row = df.loc[1] # Selects the second row by label (index 1)
# Selecting rows by index using iloc
row = df.iloc[1] # Selects the second row by position (index 1)
# Selecting a range of rows
subset = df.iloc[0:2] # Selects the first two rows
7.Filtering and Querying Data:-
You can filter the rows of a DataFrame by applying conditions on the data.
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
# Filtering rows with multiple conditions
filtered_df = df[(df['Age'] > 25) & (df['City'] == 'New York')]
You can also use the .query() method for filtering:
# Using query method
filtered_df = df.query('Age > 25 & City == "New York"')
8.Modifying Data:-
• Adding New Columns: You can add new columns to the DataFrame by assigning values
to a new column name.
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
• Modifying Existing Columns: You can modify columns by applying operations on them.
# Updating an existing column
df['Age'] = df['Age'] + 1 # Increase each age by 1
9.Handling Missing Data:- Pandas makes it easy to identify and handle missing data (NaN
values).
• Checking for Missing Data:
# Check for missing values in the DataFrame
print(df.isnull())
print(df.isnull().sum()) # Count missing values in each column
• Filling Missing Data: You can fill missing values using the .fillna() method.
# Fill missing values with a default value
df['Salary'].fillna(0, inplace=True)
• Dropping Missing Data: You can drop rows or columns with missing values
using .dropna().
# Drop rows with missing data
df.dropna(inplace=True)
10.Grouping and Aggregating Data:- You can group data based on specific columns and
perform aggregation operations like sum, mean, count, etc.
# Grouping by a column and calculating the mean of another column
grouped_df = df.groupby('City')['Age'].mean()
print(grouped_df)
11.Merging and Joining DataFrames:- Pandas supports merging multiple DataFrames using the
.merge() method (similar to SQL joins).
# Merging two DataFrames
merged_df = pd.merge(df1, df2, on='ID', how='inner') # Inner join on
the 'ID' column
12.Exporting Data:- Pandas allows you to export DataFrames to various file formats.
• Exporting to CSV:
# Save DataFrame to a CSV file
df.to_csv('output.csv', index=False)
• Exporting to Excel:
# Save DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)