Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
54 views30 pages

Pandas

Pandas is a Python library for data analysis and manipulation, created by Wes McKinney in 2008. It provides data structures like Series and DataFrame for handling one-dimensional and two-dimensional data, respectively, and includes functions for data cleaning, transformation, and visualization. The document covers various functionalities of Pandas, including reading CSV and JSON files, handling missing values, and performing data integration and manipulation.

Uploaded by

rishavranjan1607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views30 pages

Pandas

Pandas is a Python library for data analysis and manipulation, created by Wes McKinney in 2008. It provides data structures like Series and DataFrame for handling one-dimensional and two-dimensional data, respectively, and includes functions for data cleaning, transformation, and visualization. The document covers various functionalities of Pandas, including reading CSV and JSON files, handling missing values, and performing data integration and manipulation.

Uploaded by

rishavranjan1607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Pandas

• Pandas is a Python library.


• Pandas is used to analyze data.
• Pandas is a Python library used for working with data
sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
• The name "Pandas" has a reference to both "Panel
Data", and "Python Data Analysis" and was created by
Wes McKinney in 2008.

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 2


What is a Series?
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
• Example
• Create a simple Pandas Series from a list:
• import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 3
What is a DataFrame?
• A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.
• Example
• Create a simple Pandas DataFrame:
• import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)

print(df)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 4
Read CSV Files
• A simple way to store big data sets is to use CSV files (comma separated files).
• CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
• In our examples we will be using a CSV file called 'data.csv'.
• Example
• Load the CSV into a DataFrame:
• import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())
• Tip: use to_string() to print the entire DataFrame.
• Example
• Print the DataFrame without the to_string() method:
• import pandas as pd

df = pd.read_csv('data.csv')

print(df)

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 5


Read JSON
• Big data sets are often stored, or extracted as JSON.
• JSON is plain text, but has the format of an object, and is well
known in the world of programming, including Pandas.
• In our examples we will be using a JSON file called 'data.json'.
• Example
• Load the JSON file into a DataFrame:
• import pandas as pd

df = pd.read_json('data.json')

print(df.to_string())
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 6
Analyzing DataFrames
• Viewing the Data
• One of the most used method for getting a quick overview of the DataFrame, is
the head() method.
• The head() method returns the headers and a specified number of rows, starting from the
top.
• Example
• Get a quick overview by printing the first 10 rows of the DataFrame:
• import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))
• Print the first 5 rows of the DataFrame:
• import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())
• Example
• Print the last 5 rows of the DataFrame:
• print(df.tail())
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 7
Cleaning Data
• Data cleaning means fixing bad data in your data set.
• Bad data could be:
Empty cells
Data in wrong format
Wrong data
Duplicates
• Our Data Set
• The data set contains some empty cells ("Date" in row 22, and
"Calories" in row 18 and 28).
• The data set contains wrong format ("Date" in row 26).
• The data set contains wrong data ("Duration" in row 7).
• The data set contains duplicates (row 11 and 12).
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 8
Cleaning Empty Cells
• Example
• Return a new Data Frame with no empty cells:
• import pandas as pd

df = pd.read_csv('data3.csv')

new_df = df.dropna()

print(new_df.to_string())
• Example
• Remove all rows with NULL values:
• import pandas as pd

df = pd.read_csv('data3.csv')

df.dropna(inplace = True)

print(df.to_string())
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 9
• Example
• Replace NULL values with the number 130:
• import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)


• Example
• Replace NULL values in the "Calories" columns with the number
130:
• import pandas as pd

df = pd.read_csv('data.csv')

df["Calories"].fillna(130, inplace = True)


04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 10
Replace Using Mean, Median, or
Mode
• A common way to replace empty cells, is to calculate the mean,
median or mode value of the column.
• Pandas uses the mean() median() and mode() methods to calculate
the respective values for a specified column:
• Example
• Calculate the MEAN, and replace any empty values with it:
• import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)


04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 11
• Example
• Calculate the MEDIAN, and replace any empty values with it:
• import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)


• Example
• Calculate the MODE, and replace any empty values with it:
• import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)


04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 12
Cleaning Data of Wrong Format
• Data of Wrong Format
• Cells with data of wrong format can make it difficult, or even impossible, to analyze
data.
• To fix it, you have two options: remove the rows, or convert all cells in the columns into
the same format.
• Convert Into a Correct Format
• In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26,
the 'Date' column should be a string that represents a date:
• Example
• Convert to date:
• import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())
• Example
• Remove rows with a NULL value in the "Date" column:
• df.dropna(subset=['Date'], inplace = True)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 13
Fixing Wrong Data
• Wrong Data
• "Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone
registered "199" instead of "1.99".
• Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it
should be.
• If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other rows the
duration is between 30 and 60.
• It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout
sessions, we conclude with the fact that this person did not work out in 450 minutes.
• Replacing Values
• One way to fix wrong values is to replace them with something else.
• In the example, it is most likely a typo, and the value should be "45" instead of "450", and we could just
insert "45" in row 7:
• Example
• Set "Duration" = 45 in row 7:
• df.loc[7, 'Duration'] = 45
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 14
• Example
• Loop through all values in the "Duration" column.
• If the value is higher than 120, set it to 120:
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
• Removing Rows
• Another way of handling wrong data is to remove the rows that contains
wrong data.
• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.
• Example
• Delete rows where "Duration" is higher than 120:
• for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 15
Removing Duplicates
• By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
• Example
• Returns True for every row that is a duplicate, otherwise False:
• print(df.duplicated())
• Removing Duplicates
• To remove duplicates, use the drop_duplicates() method.
• Example
• Remove all duplicates:
• df.drop_duplicates(inplace = True)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 16
Imputation Techniques(Handling
Missing Data)
• In data preprocessing, missing values can cause problems in analysis
and modeling. Imputation is the process of filling in missing values
with estimated ones.
Checking for Missing Values
• Before imputing, we check for missing values in a DataFrame

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 17


Checking for Missing Values

• import pandas as pd

• # Sample dataset with missing values


• data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
• 'Age': [25, None, 30, 22], # Missing age for Bob
• 'Salary': [50000, 60000, None, 55000]} # Missing salary for Charlie

• df = pd.DataFrame(data)

• # Check for missing values


• print(df.isnull().sum())
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 18
Imputation Techniques in Pandas

• Mean, Median, and Mode Imputation (for Numerical Data)

# Fill missing values in 'Age' with Mean


• df['Age'].fillna(df['Age'].mean(), inplace=True)

• # Fill missing values in 'Salary' with Median


• df['Salary'].fillna(df['Salary'].median(), inplace=True)

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 19


Data Transformation
• Data transformation is a crucial step in data preprocessing, where
data is converted, modified, or reshaped to improve model
performance or analysis.
Scaling and Normalization
• Used to bring numerical values into a specific range for better model
performance.
Min-Max Scaling (Normalization)
• Brings values between 0 and 1.

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 20


cont
• import pandas as pd
• from sklearn.preprocessing import MinMaxScaler

• # Sample dataset
• df = pd.DataFrame({'Salary': [40000, 45000, 60000, 100000, 200000]})

• # Apply Min-Max Scaling


• scaler = MinMaxScaler()
• df['Salary_Normalized'] = scaler.fit_transform(df[['Salary']])
• print(df)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 21
Standardization (Z-Score Scaling)

• Converts data to have mean = 0 and standard deviation = 1.

• from sklearn.preprocessing import StandardScaler


• scaler = StandardScaler()
• df['Salary_Standardized'] = scaler.fit_transform(df[['Salary']])
• print(df)

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 22


Data Integration and
Manipulation
• Data integration and manipulation are essential for combining
multiple datasets, transforming data, and preparing it for analysis or
machine learning.
• Data Integration (Merging & Joining DataFrames)
• Combining multiple datasets is common in real-world scenarios.
Pandas provides functions like merge(), concat() and join() for this.

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 23


Cont..

• import pandas as pd

• # Sample DataFrames
• df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25,
30, 35]})
• df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [50000, 60000, 70000]})

• # Inner Join (only matching rows)


• merged_df = pd.merge(df1, df2, on='ID', how='inner')
• print(merged_df)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 24
Concatenating DataFrames
• Row-wise concatenation
• df3 = pd.DataFrame({'ID': [4, 5], 'Name': ['David', 'Eve'], 'Age': [40,
28]})
• df_concat = pd.concat([df1, df3], ignore_index=True)
• print(df_concat)
• Column-wise concatenation
• df4 = pd.DataFrame({'City': ['NY', 'LA', 'SF']})
• df_col_concat = pd.concat([df1, df4], axis=1)
• print(df_col_concat)
04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 25
Joining DataFrames (Index-based
Merge)
• df1.set_index('ID', inplace=True)
• df2.set_index('ID', inplace=True)
• df_joined = df1.join(df2, how='inner') # Similar to merge but uses
index
• print(df_joined)

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 26


Data Manipulation
• Handling Missing Values
• df.fillna(df.mean(), inplace=True) # Fill missing values with column
mean
• df.dropna(inplace=True) # Remove rows with missing values
• Changing Data Types
• df['Age'] = df['Age'].astype(float) # Convert Age column to float
• Renaming Columns
• df.rename(columns={'Name': 'Full_Name'}, inplace=True)

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 27


Cont.
• Filtering & Selecting Data
• df_filtered = df[df['Age'] > 30] # Select rows where Age > 30
• Grouping & Aggregation
• df.groupby('City')['Salary'].mean() # Get average salary by city
• df.pivot_table(values='Salary', index='City', aggfunc='sum') # Pivot
table
• Adding New Columns
• df['Bonus'] = df['Salary'] * 0.10 # Adding a 10% bonus column

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 28


Plotting
• Plotting
• Pandas uses the plot() method to create diagrams.
• We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.
• Read more about Matplotlib in our Matplotlib Tutorial.
• Example
• Import pyplot from Matplotlib and visualize our DataFrame:
• import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot()

plt.show()
• Scatter Plot
• Specify that you want a scatter plot with the kind argument:
• kind = 'scatter'
• A scatter plot needs an x- and a y-axis.
• In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
• Include the x and y arguments like this:
• x = 'Duration', y = 'Calories'

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 29


• Example
• import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()
• Histogram
• Use the kind argument to specify that you want a histogram:
• kind = 'hist'
• A histogram needs only one column.
• A histogram shows us the frequency of each interval, e.g. how many workouts
lasted between 50 and 60 minutes?
• In the example below we will use the "Duration" column to create the histogram:
• Example
• df["Duration"].plot(kind = 'hist')

04/04/2025 Pandas. Prof Himanshu Bhusan Mohapatra 30

You might also like