Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views20 pages

Pandas

Uploaded by

feroz22sep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

Pandas

Uploaded by

feroz22sep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

keyboard_arrow_down Pandas

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

d={'fruits':["apple","banana","orange"],'vegetables':["tomato","onion","carrot"]}

import pandas as pd

pd

fruits vegetables

0 apple tomato

1 banana onion

2 orange carrot

keyboard_arrow_down Pandas Series


A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)

0 3
1 6
2 7
dtype: int64

keyboard_arrow_down Labels
import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)
print(s[0])

0 3
1 6
2 7
dtype: int64
3

s=pd.Series(a,index=["x","y","z"])
print(s)

x 3
y 6
z 7
dtype: int64

Key/Value Objects as Series


You can also use a key/value object, like a dictionary, when creating a Series.

import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(calories)
print(s)

day1 420
day2 380
day3 390
dtype: int64

import pandas as pd
Family={"father":"Chand Basha","Mother":"Fathima","D1":"Farasha","D2":"Sana","D3":"Firoz"}
s=pd.Series(Family)
print(s)

father Chand Basha


Mother Fathima
D1 Farasha
D2 Sana
D3 Firoz
dtype: object

keyboard_arrow_down DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

import pandas as pd
data={
"Exam": ["Python","Java","Data Science"],
"Marks":[80,90,100]
}
mydata=pd.DataFrame(data)
print(mydata)

Exam Marks
0 Python 80
1 Java 90
2 Data Science 100

import pandas as pd
data={
"Sisters": ["Farasha","Sana"],
"Parents": ["Chand","Fathima"]
}
mydta=pd.DataFrame(data)
print(mydta)

Sisters Parents
0 Farasha Chand
1 Sana Fathima

A Pandas DataFrame is a 2 dimensional data structure,

like a 2 dimensional array, or a table with rows and columns.

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)
print(df)

calories duration
0 420 50
1 380 40
2 390 45

Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

print(mydta.loc[0])

Sisters Farasha
Parents Chand
Name: 0, dtype: object

print(mydta.loc[1])

Sisters Sana
Parents Fathima
Name: 1, dtype: object

print(mydata.loc[0])

Exam Python
Marks 80
Rollno 1
Name: 0, dtype: object

print(mydata.loc[1])

Exam Java
Marks 90
Rollno 2
Name: 1, dtype: object

Example

import pandas as pd
data={

"Rollno":[5801, 5802, 5803, 5804],


"Students":["Pavan", "Kavya", "Firoz", "Manoj"]
}
df=pd.DataFrame(data)
print(df)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Firoz
3 5804 Manoj

print(df.loc[0])

Rollno 5801
Students Pavan
Name: 0, dtype: object

print(df.loc[[1,2]])

Rollno Students
1 5802 Kavya
2 5803 Firoz
Named Indexes

import pandas as pd

data = {
"Emcet marks": [420, 380, 390],
"Rank": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["Fahad", "Shazi", "Sania"])

print(df)

Emcet marks Rank


Fahad 420 50
Shazi 380 40
Sania 390 45

Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

print(df.loc["Shazi"])

Emcet marks 380


Rank 40
Name: Shazi, dtype: int64

keyboard_arrow_down Load Files Into a DataFrame


Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

from google.colab import files


uploaded = files.upload()

Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please
rerun this cell to enable.
Saving data.csv to data.csv

If you have a large DataFrame with many rows,

Pandas will only return the first 5 rows, and the last 5 rows:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Duration Pulse Maxpulse Calories


0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]


Use : to_string()

to print the entire DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

111 45 107 124 275.0


112 15 124 139 124.2
113 45 100 120 225.3
114 60 108 131 367.6
115 60 108 151 351.7
116 60 116 141 443.0
117 60 97 122 277.4
118 60 105 125 NaN
119 60 103 124 332.7
120 30 112 137 193.9
121 45 100 120 100.7
122 60 119 169 336.7
123 60 107 127 344.9
124 60 111 151 368.5
125 60 98 122 271.0
126 60 97 124 275.3
127 60 109 127 382.0
128 90 99 125 466.4
129 60 114 151 384.0
130 60 104 134 342.5
131 60 107 138 357.5
132 60 103 133 335.0
133 60 106 132 327.5
134 60 103 136 339.0
135 20 136 156 189.0
136 45 117 143 317.7
137 45 115 137 318.0
138 45 113 138 308.0
139 20 141 162 222.4
140 60 108 135 390.0
141 60 97 127 NaN
142 45 100 120 250.4
143 45 122 149 335.4
144 60 136 170 470.2
145 45 106 126 270.8
146 60 107 136 400.0
147 60 112 146 361.9
148 30 103 127 185.0
149 60 110 150 409.4
150 60 106 134 343.0
151 60 109 129 353.2
152 60 109 138 374.0
153 30 150 167 275.8
154 60 105 128 328.0
155 60 111 151 368.5
156 60 97 131 270.4
157 60 100 120 270.4
158 60 114 150 382.8
159 30 80 120 240.9
160 30 85 120 250.4
161 45 90 130 260.4
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

max_rows

The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

print(pd.options.display.max_rows)
60

keyboard_arrow_down Handling Missing Data


import pandas as pd
data={

"Rollno":[5801, 5802, 5803, 5804],


"Students":["Pavan", "Kavya", None, "Manoj"]
}
df=pd.DataFrame(data)
print(df)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj

1. Check for Missing Values

print(df.isnull())
print(df.isnull().sum())

Rollno Students
0 False False
1 False False
2 False True
3 False False
Rollno 0
Students 1
dtype: int64

df.dropped=df.dropna()
print(df.dropped)

Rollno Students
0 5801 Pavan
1 5802 Kavya
3 5804 Manoj
/tmp/ipython-input-6-493819295.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name
df.dropped=df.dropna()

df.filled=df.fillna(0)
print(df.filled)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 0
3 5804 Manoj

df_bfill=df.fillna(method="bfill")
print(df_bfill)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Manoj
3 5804 Manoj
/tmp/ipython-input-10-902698713.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a
df_bfill=df.fillna(method="bfill")

df_filled_mean=df.fillna(df.mean(numeric_only=True))
print(df_filled_mean)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj

keyboard_arrow_down Hierarchical Indexes


Hierarchical Indexes are also known as multi-indexing is setting more than one column name as the index

Why Use Hierarchical Indexing? Hierarchical Indexing offers several advantages:

Organized Data: It helps in organizing and structuring data in a more intuitive way. Efficient Data Slicing: You can slice and dice data
across multiple dimensions easily. Enhanced Grouping: Grouping operations become more powerful and flexible. Clearer Analysis:
Complex data analysis becomes more manageable and understandable.

Creating a MultiIndex

\Let’s start by creating a MultiIndex. Assume we have data on students’ scores in different subjects across various semesters.
Here’s how we can create a MultiIndex DataFrame:

import pandas as pd
import numpy as np

index = pd.MultiIndex.from_tuples(
[('India', 'Delhi'), ('India', 'Mumbai'), ('USA', 'New York'), ('USA', 'LA')],
names=['Country', 'City']
)

data = pd.Series([100, 150, 200, 180], index=index)


print(data)

Country City
India Delhi 100
Mumbai 150
USA New York 200
LA 180
dtype: int64

arrays = [
['India', 'India', 'USA', 'USA'],
['Delhi', 'Mumbai', 'New York', 'LA']
]

index = pd.MultiIndex.from_arrays(arrays, names=('Country', 'City'))

df = pd.DataFrame({
'2022': [100, 120, 200, 180],
'2023': [130, 140, 220, 190]
}, index=index)

print(df)

2022 2023
Country City
India Delhi 100 130
Mumbai 120 140
USA New York 200 220
LA 180 190

Stacking and Unstacking

import pandas as pd

data = {
'State': ['Karnataka', 'Karnataka', 'Maharashtra', 'Maharashtra'],
'City': ['Bangalore', 'Mysore', 'Mumbai', 'Pune'],
'2022': [100, 120, 140, 160],
'2023': [110, 130, 150, 170]
}

df = pd.DataFrame(data)
df = df.set_index(['State', 'City'])
print(df)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

stacked = df.stack()
print(stacked)

State City
Karnataka Bangalore 2022 100
2023 110
Mysore 2022 120
2023 130
Maharashtra Mumbai 2022 140
2023 150
Pune 2022 160
2023 170
dtype: int64

unstacked = stacked.unstack()
print(unstacked)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Swapping index levels

print(df)
swapped = df.swaplevel()
print(swapped)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
2022 2023
City State
Bangalore Karnataka 100 110
Mysore Karnataka 120 130
Mumbai Maharashtra 140 150
Pune Maharashtra 160 170

Sorting Index Levels

sorted_df = df.sort_index(level=0)
sorted_df2 = df.sort_index(level=1)
sorted_df3 = df.sort_index(level=[0, 1])
print(sorted_df3)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Indexing with .xs() (cross-section)

print(df.xs('Karnataka'))
print(df.xs('Mumbai', level='City'))
print(df.loc[('Maharashtra', 'Pune'), '2022'])

2022 2023
City
Bangalore 100 110
Mysore 120 130
2022 2023
State
Maharashtra 140 150
160

Set Index with Multiple Columns

df_reset = df.reset_index()
print(df_reset)

State City 2022 2023


0 Karnataka Bangalore 100 110
1 Karnataka Mysore 120 130
2 Maharashtra Mumbai 140 150
3 Maharashtra Pune 160 170

df_multi = df_reset.set_index(['State', 'City'])


print(df_multi)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Concat()

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})
result = pd.concat([df1, df2])
print(result)

A B
0 1 x
1 2 y
0 3 z
1 4 w

pd.concat([df1, df2], axis=1)

A B A B

0 1 x 3 z

1 2 y 4 w

Merge

left = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

right = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [85, 90, 95]
})
merged = pd.merge(left, right, on='ID', how='inner')
print(merged)

ID Name Score
0 2 Bob 85
1 3 Charlie 90

pd.merge(left, right, on='ID', how='left')

ID Name Score

0 1 Alice NaN

1 2 Bob 85.0

2 3 Charlie 90.0

pd.merge(left, right, on='ID', how='outer')

ID Name Score

0 1 Alice NaN

1 2 Bob 85.0

2 3 Charlie 90.0

3 4 NaN 95.0

keyboard_arrow_down JOIN()
join() is a convenient method for combining columns of two DataFrames based on the index (by default).

It works similarly to SQL joins (left, right, outer, inner).

It’s a shortcut to merge() when joining on the inde

import pandas as pd
a = pd.DataFrame()
d = {'id': [1, 2, 10, 12],
'val1': ['a', 'b', 'c', 'd']}
a = pd.DataFrame(d)
a

id val1

0 1 a

1 2 b

2 10 c

3 12 d

import pandas as pd
b=pd.DataFrame()
d = {'id' : [1,2,9,8],
'val2': ['e','f','g','h']}
b=pd.DataFrame(d)
b
id val2

0 1 e

1 2 f

2 9 g

3 8 h

keyboard_arrow_down Types of Joins in Pandas


We will use these two Dataframes to understand the different types of joins.

Pandas Inner Join

Inner join is the most common type of join you’ll be working with. It returns a Dataframe with only those rows that have common
characteristics. This is similar to the intersection of two sets.

df = pd.merge(a, b, on='id', how='inner')


df

id val1 val2

0 1 a e

1 2 b f

df = pd.merge(a, b, on='id', how='inner')


df

id val1 val2

0 1 a e

1 2 b f

Pandas Full Outer Join

A full outer join returns all the rows from the left Dataframe, and all the rows from the right Dataframe, and matches up rows where
possible, with NaNs elsewhere. But if the Dataframe is complete, then we get the same output.

df = pd.merge(a,b, on='id',how='outer')
df

id val1 val2

0 1 a e

1 2 b f

2 8 NaN h

3 9 NaN g

4 10 c NaN

5 12 d NaN

Pandas Left Join

With a left outer join, all the records from the first Dataframe will be displayed, irrespective of whether the keys in the first Dataframe
can be found in the second Dataframe. Whereas, for the second Dataframe, only the records with the keys in the second Dataframe
that can be found in the first Dataframe will be displayed.

df = pd.merge(a,b, on='id',how='left')
df
id val1 val2

0 1 a e

1 2 b f

2 10 c NaN

3 12 d NaN

Pandas Right Outer Join

For a right join, all the records from the second Dataframe will be displayed. However, only the records with the keys in the first
Dataframe that can be found in the second Dataframe will be displayed.

df = pd.merge(a,b, on='id',how='right')
df

id val1 val2

0 1 a e

1 2 b f

2 9 NaN g

3 8 NaN h

Pandas Index Join

To merge the Dataframe on indices pass the left_index and right_index arguments as True i.e. both the Dataframes are merged on
an index using default Inner Join.

df = pd.merge(a,b, right_index=True, left_index=True)


df

id_x val1 id_y val2

0 1 a 1 e

1 2 b 2 f

2 10 c 9 g

3 12 d 8 h

keyboard_arrow_down groupby()
The groupby() method allows you to group your data and execute functions on these groups.

import pandas as pd
data={
'Department':['HR','HR','IT','IT','Finance','Finance'],
'Employee':['A','B','C','D','E','F'],
'Salary':[1000,2000,3000,4000,5000,6000]
}
df=pd.DataFrame(data)
grouped=df.groupby('Department')
print(grouped['Salary'].sum())

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

Aggregate()
Aggregation is the process of combining multiple values into a single summary value. In Pandas, aggregation happens after
groupingthe data using groupby().It is used to compute summary statistics such as: Sum

result=grouped['Salary'].aggregate('sum')
print(result)

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

print(df.groupby('Department')['Salary'].sum())

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

Multiple Agreegate functions

result=grouped['Salary'].aggregate(['sum','mean','max'])
print(result)

sum mean max


Department
Finance 11000 5500.0 6000
HR 3000 1500.0 2000
IT 7000 3500.0 4000

Agreegation on Multiple columns

df.groupby('Department').agg({'Salary':'sum','Employee':'count'})

Salary Employee

Department

Finance 11000 2

HR 3000 2

IT 7000 2

df.groupby('Department').agg({'Salary':['mean','min'],'Employee':['sum','max']})

Salary Employee

mean min sum max

Department

Finance 5500.0 5000 EF F

HR 1500.0 1000 AB B

IT 3500.0 3000 CD D

Custom agreegation Functions

Create a planet Dataset

import pandas as pd

data = {
'method': ['Radial Velocity', 'Radial Velocity', 'Transit', 'Transit', 'Imaging',
'Radial Velocity', 'Microlensing', 'Transit', 'Imaging', 'Transit'],
'number': [1, 1, 1, 2, 1, 1, 1, 3, 2, 1],
'orbital_period': [269.3, 874.8, 1.5, 2.2, 4100.0, 763.0, 1000.5, 3.5, 2000.0, 1.0],
'mass': [7.10, 2.21, 0.02, 0.03, 5.00, 2.60, 3.40, 0.01, 6.50, 0.02],
'distance': [77.4, 56.95, 300.0, 150.5, 25.0, 19.84, 4000.0, 80.0, 32.0, 75.0],
'year': [2006, 2008, 2012, 2014, 2005, 2011, 2013, 2015, 2010, 2011]
}

df = pd.DataFrame(data)
print(df)

method number orbital_period mass distance year


0 Radial Velocity 1 269.3 7.10 77.40 2006
1 Radial Velocity 1 874.8 2.21 56.95 2008
2 Transit 1 1.5 0.02 300.00 2012
3 Transit 2 2.2 0.03 150.50 2014
4 Imaging 1 4100.0 5.00 25.00 2005
5 Radial Velocity 1 763.0 2.60 19.84 2011
6 Microlensing 1 1000.5 3.40 4000.00 2013
7 Transit 3 3.5 0.01 80.00 2015
8 Imaging 2 2000.0 6.50 32.00 2010
9 Transit 1 1.0 0.02 75.00 2011

df.groupby('method')['mass'].mean()

mass

method

Imaging 5.75

Microlensing 3.40

Radial Velocity 3.97

Transit 0.02

dtype: float64

df.groupby('method')['mass'].aggregate(['count','mean', 'min', 'max'])

count mean min max

method

Imaging 2 5.75 5.00 6.50

Microlensing 1 3.40 3.40 3.40

Radial Velocity 3 3.97 2.21 7.10

Transit 4 0.02 0.01 0.03

df.groupby('year')['number'].sum()

number

year

2005 1

2006 1

2008 1

2010 2

2011 2

2012 1

2013 1

2014 2

2015 3

dtype: int64
df.groupby(['method','year']).size().unstack(fill_value=0)

year 2005 2006 2008 2010 2011 2012 2013 2014 2015

method

Imaging 1 0 0 1 0 0 0 0 0

Microlensing 0 0 0 0 0 0 1 0 0

Radial Velocity 0 1 1 0 1 0 0 0 0

Transit 0 0 0 0 1 1 0 1 1

df.groupby('method')['distance'].mean()

distance

method

Imaging 28.500000

Microlensing 4000.000000

Radial Velocity 51.396667

Transit 151.375000

dtype: float64

df.groupby('method')['distance'].aggregate(lambda x: x.max() - x.min())

distance

method

Imaging 7.00

Microlensing 0.00

Radial Velocity 57.56

Transit 225.00

dtype: float64

df.groupby('method').filter(lambda x: len(x)>2)

method number orbital_period mass distance year

0 Radial Velocity 1 269.3 7.10 77.40 2006

1 Radial Velocity 1 874.8 2.21 56.95 2008

2 Transit 1 1.5 0.02 300.00 2012

3 Transit 2 2.2 0.03 150.50 2014

5 Radial Velocity 1 763.0 2.60 19.84 2011

7 Transit 3 3.5 0.01 80.00 2015

9 Transit 1 1.0 0.02 75.00 2011

Pivot table
import pandas as pd
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})

df
table = pd.pivot_table(df, index=['A', 'B'])
table

A B

Boby Graduate 23.0

John Masters 27.0

Mina Graduate 21.0

Nicky Graduate 24.0

Peter Masters 23.0

import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
table = pd.pivot_table(df, values='C', index='C', columns='B', aggfunc='sum')
print(table)

B Graduate Masters
C
21 Mina NaN
23 Boby Peter
24 Nicky NaN
27 NaN John

import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
table = pd.pivot_table(df, values='C', index=['A', 'B'], aggfunc='mean', margins=True)
table

A B

Boby Graduate 23.0

John Masters 27.0

Mina Graduate 21.0

Nicky Graduate 24.0

Peter Masters 23.0

All 23.6

import pandas as pd

df = pd.DataFrame({'Product': ['Carrots', 'Broccoli', 'Banana', 'Banana',


'Beans', 'Orange', 'Broccoli', 'Banana'],
'Category': ['Vegetable', 'Vegetable', 'Fruit', 'Fruit',
'Vegetable', 'Fruit', 'Vegetable', 'Fruit'],
'Quantity': [8, 5, 3, 4, 5, 9, 11, 8],
'Amount': [270, 239, 617, 384, 626, 610, 62, 90]})
df

Product Category Quantity Amount

0 Carrots Vegetable 8 270

1 Broccoli Vegetable 5 239

2 Banana Fruit 3 617

3 Banana Fruit 4 384

4 Beans Vegetable 5 626

5 Orange Fruit 9 610

6 Broccoli Vegetable 11 62

7 Banana Fruit 8 90

pivot = df.pivot_table(index=['Product'],
values=['Amount'],
aggfunc='sum')
print(pivot)

Amount
Product
Banana 1091
Beans 626
Broccoli 301
Carrots 270
Orange 610

pivot = df.pivot_table(index=['Category'],
values=['Amount'],
aggfunc='sum')
print(pivot)

Amount
Category
Fruit 1701
Vegetable 1197

pivot = df.pivot_table(index=['Product', 'Category'],


values=['Amount'], aggfunc='sum')
print(pivot)

Amount
Product Category
Banana Fruit 1091
Beans Vegetable 626
Broccoli Vegetable 301
Carrots Vegetable 270
Orange Fruit 610

pivot = df.pivot_table(index=['Category'], values=['Amount'],


aggfunc={'median', 'mean', 'min'})
print(pivot)

Amount
mean median min
Category
Fruit 425.25 497.0 90
Vegetable 299.25 254.5 62

Start coding or generate with AI.

keyboard_arrow_down Vectorized String operations


Vectorized string operations in Pandas are powerful and efficient because they are optimized for performance and operate element-
wise on entire columns (i.e., Series) of string values without using loops.

In Pandas, you can access string methods using the .str accessor on a Series. Here's a clear overview with examples:

import pandas as pd

data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'city': ['New York', 'los angeles', 'Chicago', 'Houston', 'PHOENIX']
}

df = pd.DataFrame(data)

1. Case Conversation

df['name'].str.lower()

name

0 alice

1 bob

2 charlie

3 david

4 eva

dtype: object

df['city'].str.upper()

city

0 NEW YORK

1 LOS ANGELES

2 CHICAGO

3 HOUSTON

4 PHOENIX

dtype: object

df['name'].str.title()

name

0 Alice

1 Bob

2 Charlie

3 David

4 Eva

dtype: object

2. String Matching and Searching

contains()

df['name'].str.contains('o')
name

0 False

1 True

2 False

3 False

4 False

dtype: bool

startswith()

df['name'].str.startswith('A')

name

0 True

1 False

2 False

3 False

4 False

dtype: bool

endswith()

df['city'].str.endswith('a')

city

0 False

1 False

2 False

3 False

4 False

dtype: bool

df['name'].str.match('A.*')

name

0 True

1 False

2 False

3 False

4 False

dtype: bool

3. String Replacement

df['name'].str.replace('a','A')
name

0 Alice

1 Bob

2 ChArlie

3 DAvid

4 EvA

df['name'].str[0:4]
dtype: object

name

0 Alic

1 Bob

2 Char

3 Davi

4 Eva

dtype: object

df['name'].str.slice(0, 3)

name

0 Ali

1 Bob

2 Cha

3 Dav

4 Eva

dtype: object

df['city'].str.len()

city

0 8

1 11

2 7

3 7

4 7

dtype: int64

df['name'].str.split()

name

0 [Alice]

1 [Bob]

2 [Charlie]

3 [David]

4 [Eva]

dtype: object

You might also like