Pandas
Pandas
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
d={'fruits':["apple","banana","orange"],'vegetables':["tomato","onion","carrot"]}
import pandas as pd
pd
fruits vegetables
0 apple tomato
1 banana onion
2 orange carrot
import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)
0 3
1 6
2 7
dtype: int64
keyboard_arrow_down Labels
import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)
print(s[0])
0 3
1 6
2 7
dtype: int64
3
s=pd.Series(a,index=["x","y","z"])
print(s)
x 3
y 6
z 7
dtype: int64
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(calories)
print(s)
day1 420
day2 380
day3 390
dtype: int64
import pandas as pd
Family={"father":"Chand Basha","Mother":"Fathima","D1":"Farasha","D2":"Sana","D3":"Firoz"}
s=pd.Series(Family)
print(s)
keyboard_arrow_down DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
import pandas as pd
data={
"Exam": ["Python","Java","Data Science"],
"Marks":[80,90,100]
}
mydata=pd.DataFrame(data)
print(mydata)
Exam Marks
0 Python 80
1 Java 90
2 Data Science 100
import pandas as pd
data={
"Sisters": ["Farasha","Sana"],
"Parents": ["Chand","Fathima"]
}
mydta=pd.DataFrame(data)
print(mydta)
Sisters Parents
0 Farasha Chand
1 Sana Fathima
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
print(mydta.loc[0])
Sisters Farasha
Parents Chand
Name: 0, dtype: object
print(mydta.loc[1])
Sisters Sana
Parents Fathima
Name: 1, dtype: object
print(mydata.loc[0])
Exam Python
Marks 80
Rollno 1
Name: 0, dtype: object
print(mydata.loc[1])
Exam Java
Marks 90
Rollno 2
Name: 1, dtype: object
Example
import pandas as pd
data={
Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Firoz
3 5804 Manoj
print(df.loc[0])
Rollno 5801
Students Pavan
Name: 0, dtype: object
print(df.loc[[1,2]])
Rollno Students
1 5802 Kavya
2 5803 Firoz
Named Indexes
import pandas as pd
data = {
"Emcet marks": [420, 380, 390],
"Rank": [50, 40, 45]
}
print(df)
Use the named index in the loc attribute to return the specified row(s).
print(df.loc["Shazi"])
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please
rerun this cell to enable.
Saving data.csv to data.csv
Pandas will only return the first 5 rows, and the last 5 rows:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
max_rows
You can check your system's maximum rows with the pd.options.display.max_rows statement.
print(pd.options.display.max_rows)
60
Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj
print(df.isnull())
print(df.isnull().sum())
Rollno Students
0 False False
1 False False
2 False True
3 False False
Rollno 0
Students 1
dtype: int64
df.dropped=df.dropna()
print(df.dropped)
Rollno Students
0 5801 Pavan
1 5802 Kavya
3 5804 Manoj
/tmp/ipython-input-6-493819295.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name
df.dropped=df.dropna()
df.filled=df.fillna(0)
print(df.filled)
Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 0
3 5804 Manoj
df_bfill=df.fillna(method="bfill")
print(df_bfill)
Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Manoj
3 5804 Manoj
/tmp/ipython-input-10-902698713.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a
df_bfill=df.fillna(method="bfill")
df_filled_mean=df.fillna(df.mean(numeric_only=True))
print(df_filled_mean)
Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj
Organized Data: It helps in organizing and structuring data in a more intuitive way. Efficient Data Slicing: You can slice and dice data
across multiple dimensions easily. Enhanced Grouping: Grouping operations become more powerful and flexible. Clearer Analysis:
Complex data analysis becomes more manageable and understandable.
Creating a MultiIndex
\Let’s start by creating a MultiIndex. Assume we have data on students’ scores in different subjects across various semesters.
Here’s how we can create a MultiIndex DataFrame:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples(
[('India', 'Delhi'), ('India', 'Mumbai'), ('USA', 'New York'), ('USA', 'LA')],
names=['Country', 'City']
)
Country City
India Delhi 100
Mumbai 150
USA New York 200
LA 180
dtype: int64
arrays = [
['India', 'India', 'USA', 'USA'],
['Delhi', 'Mumbai', 'New York', 'LA']
]
df = pd.DataFrame({
'2022': [100, 120, 200, 180],
'2023': [130, 140, 220, 190]
}, index=index)
print(df)
2022 2023
Country City
India Delhi 100 130
Mumbai 120 140
USA New York 200 220
LA 180 190
import pandas as pd
data = {
'State': ['Karnataka', 'Karnataka', 'Maharashtra', 'Maharashtra'],
'City': ['Bangalore', 'Mysore', 'Mumbai', 'Pune'],
'2022': [100, 120, 140, 160],
'2023': [110, 130, 150, 170]
}
df = pd.DataFrame(data)
df = df.set_index(['State', 'City'])
print(df)
2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
stacked = df.stack()
print(stacked)
State City
Karnataka Bangalore 2022 100
2023 110
Mysore 2022 120
2023 130
Maharashtra Mumbai 2022 140
2023 150
Pune 2022 160
2023 170
dtype: int64
unstacked = stacked.unstack()
print(unstacked)
2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
print(df)
swapped = df.swaplevel()
print(swapped)
2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
2022 2023
City State
Bangalore Karnataka 100 110
Mysore Karnataka 120 130
Mumbai Maharashtra 140 150
Pune Maharashtra 160 170
sorted_df = df.sort_index(level=0)
sorted_df2 = df.sort_index(level=1)
sorted_df3 = df.sort_index(level=[0, 1])
print(sorted_df3)
2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
print(df.xs('Karnataka'))
print(df.xs('Mumbai', level='City'))
print(df.loc[('Maharashtra', 'Pune'), '2022'])
2022 2023
City
Bangalore 100 110
Mysore 120 130
2022 2023
State
Maharashtra 140 150
160
df_reset = df.reset_index()
print(df_reset)
2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
Concat()
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})
result = pd.concat([df1, df2])
print(result)
A B
0 1 x
1 2 y
0 3 z
1 4 w
A B A B
0 1 x 3 z
1 2 y 4 w
Merge
left = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
right = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [85, 90, 95]
})
merged = pd.merge(left, right, on='ID', how='inner')
print(merged)
ID Name Score
0 2 Bob 85
1 3 Charlie 90
ID Name Score
0 1 Alice NaN
1 2 Bob 85.0
2 3 Charlie 90.0
ID Name Score
0 1 Alice NaN
1 2 Bob 85.0
2 3 Charlie 90.0
3 4 NaN 95.0
keyboard_arrow_down JOIN()
join() is a convenient method for combining columns of two DataFrames based on the index (by default).
import pandas as pd
a = pd.DataFrame()
d = {'id': [1, 2, 10, 12],
'val1': ['a', 'b', 'c', 'd']}
a = pd.DataFrame(d)
a
id val1
0 1 a
1 2 b
2 10 c
3 12 d
import pandas as pd
b=pd.DataFrame()
d = {'id' : [1,2,9,8],
'val2': ['e','f','g','h']}
b=pd.DataFrame(d)
b
id val2
0 1 e
1 2 f
2 9 g
3 8 h
Inner join is the most common type of join you’ll be working with. It returns a Dataframe with only those rows that have common
characteristics. This is similar to the intersection of two sets.
id val1 val2
0 1 a e
1 2 b f
id val1 val2
0 1 a e
1 2 b f
A full outer join returns all the rows from the left Dataframe, and all the rows from the right Dataframe, and matches up rows where
possible, with NaNs elsewhere. But if the Dataframe is complete, then we get the same output.
df = pd.merge(a,b, on='id',how='outer')
df
id val1 val2
0 1 a e
1 2 b f
2 8 NaN h
3 9 NaN g
4 10 c NaN
5 12 d NaN
With a left outer join, all the records from the first Dataframe will be displayed, irrespective of whether the keys in the first Dataframe
can be found in the second Dataframe. Whereas, for the second Dataframe, only the records with the keys in the second Dataframe
that can be found in the first Dataframe will be displayed.
df = pd.merge(a,b, on='id',how='left')
df
id val1 val2
0 1 a e
1 2 b f
2 10 c NaN
3 12 d NaN
For a right join, all the records from the second Dataframe will be displayed. However, only the records with the keys in the first
Dataframe that can be found in the second Dataframe will be displayed.
df = pd.merge(a,b, on='id',how='right')
df
id val1 val2
0 1 a e
1 2 b f
2 9 NaN g
3 8 NaN h
To merge the Dataframe on indices pass the left_index and right_index arguments as True i.e. both the Dataframes are merged on
an index using default Inner Join.
0 1 a 1 e
1 2 b 2 f
2 10 c 9 g
3 12 d 8 h
keyboard_arrow_down groupby()
The groupby() method allows you to group your data and execute functions on these groups.
import pandas as pd
data={
'Department':['HR','HR','IT','IT','Finance','Finance'],
'Employee':['A','B','C','D','E','F'],
'Salary':[1000,2000,3000,4000,5000,6000]
}
df=pd.DataFrame(data)
grouped=df.groupby('Department')
print(grouped['Salary'].sum())
Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64
Aggregate()
Aggregation is the process of combining multiple values into a single summary value. In Pandas, aggregation happens after
groupingthe data using groupby().It is used to compute summary statistics such as: Sum
result=grouped['Salary'].aggregate('sum')
print(result)
Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64
print(df.groupby('Department')['Salary'].sum())
Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64
result=grouped['Salary'].aggregate(['sum','mean','max'])
print(result)
df.groupby('Department').agg({'Salary':'sum','Employee':'count'})
Salary Employee
Department
Finance 11000 2
HR 3000 2
IT 7000 2
df.groupby('Department').agg({'Salary':['mean','min'],'Employee':['sum','max']})
Salary Employee
Department
HR 1500.0 1000 AB B
IT 3500.0 3000 CD D
import pandas as pd
data = {
'method': ['Radial Velocity', 'Radial Velocity', 'Transit', 'Transit', 'Imaging',
'Radial Velocity', 'Microlensing', 'Transit', 'Imaging', 'Transit'],
'number': [1, 1, 1, 2, 1, 1, 1, 3, 2, 1],
'orbital_period': [269.3, 874.8, 1.5, 2.2, 4100.0, 763.0, 1000.5, 3.5, 2000.0, 1.0],
'mass': [7.10, 2.21, 0.02, 0.03, 5.00, 2.60, 3.40, 0.01, 6.50, 0.02],
'distance': [77.4, 56.95, 300.0, 150.5, 25.0, 19.84, 4000.0, 80.0, 32.0, 75.0],
'year': [2006, 2008, 2012, 2014, 2005, 2011, 2013, 2015, 2010, 2011]
}
df = pd.DataFrame(data)
print(df)
df.groupby('method')['mass'].mean()
mass
method
Imaging 5.75
Microlensing 3.40
Transit 0.02
dtype: float64
method
df.groupby('year')['number'].sum()
number
year
2005 1
2006 1
2008 1
2010 2
2011 2
2012 1
2013 1
2014 2
2015 3
dtype: int64
df.groupby(['method','year']).size().unstack(fill_value=0)
year 2005 2006 2008 2010 2011 2012 2013 2014 2015
method
Imaging 1 0 0 1 0 0 0 0 0
Microlensing 0 0 0 0 0 0 1 0 0
Radial Velocity 0 1 1 0 1 0 0 0 0
Transit 0 0 0 0 1 1 0 1 1
df.groupby('method')['distance'].mean()
distance
method
Imaging 28.500000
Microlensing 4000.000000
Transit 151.375000
dtype: float64
distance
method
Imaging 7.00
Microlensing 0.00
Transit 225.00
dtype: float64
df.groupby('method').filter(lambda x: len(x)>2)
Pivot table
import pandas as pd
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
df
table = pd.pivot_table(df, index=['A', 'B'])
table
A B
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
table = pd.pivot_table(df, values='C', index='C', columns='B', aggfunc='sum')
print(table)
B Graduate Masters
C
21 Mina NaN
23 Boby Peter
24 Nicky NaN
27 NaN John
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
table = pd.pivot_table(df, values='C', index=['A', 'B'], aggfunc='mean', margins=True)
table
A B
All 23.6
import pandas as pd
6 Broccoli Vegetable 11 62
7 Banana Fruit 8 90
pivot = df.pivot_table(index=['Product'],
values=['Amount'],
aggfunc='sum')
print(pivot)
Amount
Product
Banana 1091
Beans 626
Broccoli 301
Carrots 270
Orange 610
pivot = df.pivot_table(index=['Category'],
values=['Amount'],
aggfunc='sum')
print(pivot)
Amount
Category
Fruit 1701
Vegetable 1197
Amount
Product Category
Banana Fruit 1091
Beans Vegetable 626
Broccoli Vegetable 301
Carrots Vegetable 270
Orange Fruit 610
Amount
mean median min
Category
Fruit 425.25 497.0 90
Vegetable 299.25 254.5 62
In Pandas, you can access string methods using the .str accessor on a Series. Here's a clear overview with examples:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'city': ['New York', 'los angeles', 'Chicago', 'Houston', 'PHOENIX']
}
df = pd.DataFrame(data)
1. Case Conversation
df['name'].str.lower()
name
0 alice
1 bob
2 charlie
3 david
4 eva
dtype: object
df['city'].str.upper()
city
0 NEW YORK
1 LOS ANGELES
2 CHICAGO
3 HOUSTON
4 PHOENIX
dtype: object
df['name'].str.title()
name
0 Alice
1 Bob
2 Charlie
3 David
4 Eva
dtype: object
contains()
df['name'].str.contains('o')
name
0 False
1 True
2 False
3 False
4 False
dtype: bool
startswith()
df['name'].str.startswith('A')
name
0 True
1 False
2 False
3 False
4 False
dtype: bool
endswith()
df['city'].str.endswith('a')
city
0 False
1 False
2 False
3 False
4 False
dtype: bool
df['name'].str.match('A.*')
name
0 True
1 False
2 False
3 False
4 False
dtype: bool
3. String Replacement
df['name'].str.replace('a','A')
name
0 Alice
1 Bob
2 ChArlie
3 DAvid
4 EvA
df['name'].str[0:4]
dtype: object
name
0 Alic
1 Bob
2 Char
3 Davi
4 Eva
dtype: object
df['name'].str.slice(0, 3)
name
0 Ali
1 Bob
2 Cha
3 Dav
4 Eva
dtype: object
df['city'].str.len()
city
0 8
1 11
2 7
3 7
4 7
dtype: int64
df['name'].str.split()
name
0 [Alice]
1 [Bob]
2 [Charlie]
3 [David]
4 [Eva]
dtype: object