0% found this document useful (0 votes)

3 views20 pages

Pandas

Uploaded by

feroz22sep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views20 pages

Pandas

Uploaded by

feroz22sep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

keyboard_arrow_down Pandas

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

d={'fruits':["apple","banana","orange"],'vegetables':["tomato","onion","carrot"]}

import pandas as pd

fruits vegetables

0 apple tomato

1 banana onion

2 orange carrot

keyboard_arrow_down Pandas Series

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)

0 3
1 6
2 7
dtype: int64

keyboard_arrow_down Labels
import pandas as pd
a=[3,6,7]
s=pd.Series(a)
print(s)
print(s[0])

0 3
1 6
2 7
dtype: int64
3

s=pd.Series(a,index=["x","y","z"])
print(s)

x 3
y 6
z 7
dtype: int64

Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(calories)
print(s)

day1 420
day2 380
day3 390
dtype: int64

import pandas as pd
Family={"father":"Chand Basha","Mother":"Fathima","D1":"Farasha","D2":"Sana","D3":"Firoz"}
s=pd.Series(Family)
print(s)

father Chand Basha

Mother Fathima
D1 Farasha
D2 Sana
D3 Firoz
dtype: object

keyboard_arrow_down DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

import pandas as pd
data={
"Exam": ["Python","Java","Data Science"],
"Marks":[80,90,100]
}
mydata=pd.DataFrame(data)
print(mydata)

Exam Marks
0 Python 80
1 Java 90
2 Data Science 100

import pandas as pd
data={
"Sisters": ["Farasha","Sana"],
"Parents": ["Chand","Fathima"]
}
mydta=pd.DataFrame(data)
print(mydta)

Sisters Parents
0 Farasha Chand
1 Sana Fathima

A Pandas DataFrame is a 2 dimensional data structure,

like a 2 dimensional array, or a table with rows and columns.

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:

df = pd.DataFrame(data)
print(df)

calories duration
0 420 50
1 380 40
2 390 45

Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

print(mydta.loc[0])

Sisters Farasha
Parents Chand
Name: 0, dtype: object

print(mydta.loc[1])

Sisters Sana
Parents Fathima
Name: 1, dtype: object

print(mydata.loc[0])

Exam Python
Marks 80
Rollno 1
Name: 0, dtype: object

print(mydata.loc[1])

Exam Java
Marks 90
Rollno 2
Name: 1, dtype: object

Example

import pandas as pd
data={

"Rollno":[5801, 5802, 5803, 5804],

"Students":["Pavan", "Kavya", "Firoz", "Manoj"]
}
df=pd.DataFrame(data)
print(df)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Firoz
3 5804 Manoj

print(df.loc[0])

Rollno 5801
Students Pavan
Name: 0, dtype: object

print(df.loc[[1,2]])

Rollno Students
1 5802 Kavya
2 5803 Firoz
Named Indexes

import pandas as pd

data = {
"Emcet marks": [420, 380, 390],
"Rank": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["Fahad", "Shazi", "Sania"])

print(df)

Emcet marks Rank

Fahad 420 50
Shazi 380 40
Sania 390 45

Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

print(df.loc["Shazi"])

Emcet marks 380

Rank 40
Name: Shazi, dtype: int64

keyboard_arrow_down Load Files Into a DataFrame

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

from google.colab import files

uploaded = files.upload()

Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please
rerun this cell to enable.
Saving data.csv to data.csv

If you have a large DataFrame with many rows,

Pandas will only return the first 5 rows, and the last 5 rows:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Duration Pulse Maxpulse Calories

0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]

Use : to_string()

to print the entire DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

111 45 107 124 275.0

112 15 124 139 124.2
113 45 100 120 225.3
114 60 108 131 367.6
115 60 108 151 351.7
116 60 116 141 443.0
117 60 97 122 277.4
118 60 105 125 NaN
119 60 103 124 332.7
120 30 112 137 193.9
121 45 100 120 100.7
122 60 119 169 336.7
123 60 107 127 344.9
124 60 111 151 368.5
125 60 98 122 271.0
126 60 97 124 275.3
127 60 109 127 382.0
128 90 99 125 466.4
129 60 114 151 384.0
130 60 104 134 342.5
131 60 107 138 357.5
132 60 103 133 335.0
133 60 106 132 327.5
134 60 103 136 339.0
135 20 136 156 189.0
136 45 117 143 317.7
137 45 115 137 318.0
138 45 113 138 308.0
139 20 141 162 222.4
140 60 108 135 390.0
141 60 97 127 NaN
142 45 100 120 250.4
143 45 122 149 335.4
144 60 136 170 470.2
145 45 106 126 270.8
146 60 107 136 400.0
147 60 112 146 361.9
148 30 103 127 185.0
149 60 110 150 409.4
150 60 106 134 343.0
151 60 109 129 353.2
152 60 109 138 374.0
153 30 150 167 275.8
154 60 105 128 328.0
155 60 111 151 368.5
156 60 97 131 270.4
157 60 100 120 270.4
158 60 114 150 382.8
159 30 80 120 240.9
160 30 85 120 250.4
161 45 90 130 260.4
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

max_rows

The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

print(pd.options.display.max_rows)
60

keyboard_arrow_down Handling Missing Data

import pandas as pd
data={

"Rollno":[5801, 5802, 5803, 5804],

"Students":["Pavan", "Kavya", None, "Manoj"]
}
df=pd.DataFrame(data)
print(df)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj

1. Check for Missing Values

print(df.isnull())
print(df.isnull().sum())

Rollno Students
0 False False
1 False False
2 False True
3 False False
Rollno 0
Students 1
dtype: int64

df.dropped=df.dropna()
print(df.dropped)

Rollno Students
0 5801 Pavan
1 5802 Kavya
3 5804 Manoj
/tmp/ipython-input-6-493819295.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name
df.dropped=df.dropna()

df.filled=df.fillna(0)
print(df.filled)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 0
3 5804 Manoj

df_bfill=df.fillna(method="bfill")
print(df_bfill)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 Manoj
3 5804 Manoj
/tmp/ipython-input-10-902698713.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a
df_bfill=df.fillna(method="bfill")

df_filled_mean=df.fillna(df.mean(numeric_only=True))
print(df_filled_mean)

Rollno Students
0 5801 Pavan
1 5802 Kavya
2 5803 None
3 5804 Manoj

keyboard_arrow_down Hierarchical Indexes

Hierarchical Indexes are also known as multi-indexing is setting more than one column name as the index

Why Use Hierarchical Indexing? Hierarchical Indexing offers several advantages:

Organized Data: It helps in organizing and structuring data in a more intuitive way. Efficient Data Slicing: You can slice and dice data
across multiple dimensions easily. Enhanced Grouping: Grouping operations become more powerful and flexible. Clearer Analysis:
Complex data analysis becomes more manageable and understandable.

Creating a MultiIndex

\Let’s start by creating a MultiIndex. Assume we have data on students’ scores in different subjects across various semesters.
Here’s how we can create a MultiIndex DataFrame:

import pandas as pd
import numpy as np

index = pd.MultiIndex.from_tuples(
[('India', 'Delhi'), ('India', 'Mumbai'), ('USA', 'New York'), ('USA', 'LA')],
names=['Country', 'City']
)

data = pd.Series([100, 150, 200, 180], index=index)

print(data)

Country City
India Delhi 100
Mumbai 150
USA New York 200
LA 180
dtype: int64

arrays = [
['India', 'India', 'USA', 'USA'],
['Delhi', 'Mumbai', 'New York', 'LA']
]

index = pd.MultiIndex.from_arrays(arrays, names=('Country', 'City'))

df = pd.DataFrame({
'2022': [100, 120, 200, 180],
'2023': [130, 140, 220, 190]
}, index=index)

print(df)

2022 2023
Country City
India Delhi 100 130
Mumbai 120 140
USA New York 200 220
LA 180 190

Stacking and Unstacking

import pandas as pd

data = {
'State': ['Karnataka', 'Karnataka', 'Maharashtra', 'Maharashtra'],
'City': ['Bangalore', 'Mysore', 'Mumbai', 'Pune'],
'2022': [100, 120, 140, 160],
'2023': [110, 130, 150, 170]
}

df = pd.DataFrame(data)
df = df.set_index(['State', 'City'])
print(df)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

stacked = df.stack()
print(stacked)

State City
Karnataka Bangalore 2022 100
2023 110
Mysore 2022 120
2023 130
Maharashtra Mumbai 2022 140
2023 150
Pune 2022 160
2023 170
dtype: int64

unstacked = stacked.unstack()
print(unstacked)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Swapping index levels

print(df)
swapped = df.swaplevel()
print(swapped)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170
2022 2023
City State
Bangalore Karnataka 100 110
Mysore Karnataka 120 130
Mumbai Maharashtra 140 150
Pune Maharashtra 160 170

Sorting Index Levels

sorted_df = df.sort_index(level=0)
sorted_df2 = df.sort_index(level=1)
sorted_df3 = df.sort_index(level=[0, 1])
print(sorted_df3)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Indexing with .xs() (cross-section)

print(df.xs('Karnataka'))
print(df.xs('Mumbai', level='City'))
print(df.loc[('Maharashtra', 'Pune'), '2022'])

2022 2023
City
Bangalore 100 110
Mysore 120 130
2022 2023
State
Maharashtra 140 150
160

Set Index with Multiple Columns

df_reset = df.reset_index()
print(df_reset)

State City 2022 2023

0 Karnataka Bangalore 100 110
1 Karnataka Mysore 120 130
2 Maharashtra Mumbai 140 150
3 Maharashtra Pune 160 170

df_multi = df_reset.set_index(['State', 'City'])

print(df_multi)

2022 2023
State City
Karnataka Bangalore 100 110
Mysore 120 130
Maharashtra Mumbai 140 150
Pune 160 170

Concat()

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
df2 = pd.DataFrame({'A': [3, 4], 'B': ['z', 'w']})
result = pd.concat([df1, df2])
print(result)

A B
0 1 x
1 2 y
0 3 z
1 4 w

pd.concat([df1, df2], axis=1)

A B A B

0 1 x 3 z

1 2 y 4 w

Merge

left = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

right = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [85, 90, 95]
})
merged = pd.merge(left, right, on='ID', how='inner')
print(merged)

ID Name Score
0 2 Bob 85
1 3 Charlie 90

pd.merge(left, right, on='ID', how='left')

ID Name Score

0 1 Alice NaN

1 2 Bob 85.0

2 3 Charlie 90.0

pd.merge(left, right, on='ID', how='outer')

ID Name Score

0 1 Alice NaN

1 2 Bob 85.0

2 3 Charlie 90.0

3 4 NaN 95.0

keyboard_arrow_down JOIN()
join() is a convenient method for combining columns of two DataFrames based on the index (by default).

It works similarly to SQL joins (left, right, outer, inner).

It’s a shortcut to merge() when joining on the inde

import pandas as pd
a = pd.DataFrame()
d = {'id': [1, 2, 10, 12],
'val1': ['a', 'b', 'c', 'd']}
a = pd.DataFrame(d)
a

id val1

0 1 a

1 2 b

2 10 c

3 12 d

import pandas as pd
b=pd.DataFrame()
d = {'id' : [1,2,9,8],
'val2': ['e','f','g','h']}
b=pd.DataFrame(d)
b
id val2

0 1 e

1 2 f

2 9 g

3 8 h

keyboard_arrow_down Types of Joins in Pandas

We will use these two Dataframes to understand the different types of joins.

Pandas Inner Join

Inner join is the most common type of join you’ll be working with. It returns a Dataframe with only those rows that have common
characteristics. This is similar to the intersection of two sets.

df = pd.merge(a, b, on='id', how='inner')

id val1 val2

0 1 a e

1 2 b f

df = pd.merge(a, b, on='id', how='inner')

id val1 val2

0 1 a e

1 2 b f

Pandas Full Outer Join

A full outer join returns all the rows from the left Dataframe, and all the rows from the right Dataframe, and matches up rows where
possible, with NaNs elsewhere. But if the Dataframe is complete, then we get the same output.

df = pd.merge(a,b, on='id',how='outer')
df

id val1 val2

0 1 a e

1 2 b f

2 8 NaN h

3 9 NaN g

4 10 c NaN

5 12 d NaN

Pandas Left Join

With a left outer join, all the records from the first Dataframe will be displayed, irrespective of whether the keys in the first Dataframe
can be found in the second Dataframe. Whereas, for the second Dataframe, only the records with the keys in the second Dataframe
that can be found in the first Dataframe will be displayed.

df = pd.merge(a,b, on='id',how='left')
df
id val1 val2

0 1 a e

1 2 b f

2 10 c NaN

3 12 d NaN

Pandas Right Outer Join

For a right join, all the records from the second Dataframe will be displayed. However, only the records with the keys in the first
Dataframe that can be found in the second Dataframe will be displayed.

df = pd.merge(a,b, on='id',how='right')
df

id val1 val2

0 1 a e

1 2 b f

2 9 NaN g

3 8 NaN h

Pandas Index Join

To merge the Dataframe on indices pass the left_index and right_index arguments as True i.e. both the Dataframes are merged on
an index using default Inner Join.

df = pd.merge(a,b, right_index=True, left_index=True)

id_x val1 id_y val2

0 1 a 1 e

1 2 b 2 f

2 10 c 9 g

3 12 d 8 h

keyboard_arrow_down groupby()
The groupby() method allows you to group your data and execute functions on these groups.

import pandas as pd
data={
'Department':['HR','HR','IT','IT','Finance','Finance'],
'Employee':['A','B','C','D','E','F'],
'Salary':[1000,2000,3000,4000,5000,6000]
}
df=pd.DataFrame(data)
grouped=df.groupby('Department')
print(grouped['Salary'].sum())

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

Aggregate()
Aggregation is the process of combining multiple values into a single summary value. In Pandas, aggregation happens after
groupingthe data using groupby().It is used to compute summary statistics such as: Sum

result=grouped['Salary'].aggregate('sum')
print(result)

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

print(df.groupby('Department')['Salary'].sum())

Department
Finance 11000
HR 3000
IT 7000
Name: Salary, dtype: int64

Multiple Agreegate functions

result=grouped['Salary'].aggregate(['sum','mean','max'])
print(result)

sum mean max

Department
Finance 11000 5500.0 6000
HR 3000 1500.0 2000
IT 7000 3500.0 4000

Agreegation on Multiple columns

df.groupby('Department').agg({'Salary':'sum','Employee':'count'})

Salary Employee

Department

Finance 11000 2

HR 3000 2

IT 7000 2

df.groupby('Department').agg({'Salary':['mean','min'],'Employee':['sum','max']})

Salary Employee

mean min sum max

Department

Finance 5500.0 5000 EF F

HR 1500.0 1000 AB B

IT 3500.0 3000 CD D

Custom agreegation Functions

Create a planet Dataset

import pandas as pd

data = {
'method': ['Radial Velocity', 'Radial Velocity', 'Transit', 'Transit', 'Imaging',
'Radial Velocity', 'Microlensing', 'Transit', 'Imaging', 'Transit'],
'number': [1, 1, 1, 2, 1, 1, 1, 3, 2, 1],
'orbital_period': [269.3, 874.8, 1.5, 2.2, 4100.0, 763.0, 1000.5, 3.5, 2000.0, 1.0],
'mass': [7.10, 2.21, 0.02, 0.03, 5.00, 2.60, 3.40, 0.01, 6.50, 0.02],
'distance': [77.4, 56.95, 300.0, 150.5, 25.0, 19.84, 4000.0, 80.0, 32.0, 75.0],
'year': [2006, 2008, 2012, 2014, 2005, 2011, 2013, 2015, 2010, 2011]
}

df = pd.DataFrame(data)
print(df)

method number orbital_period mass distance year

0 Radial Velocity 1 269.3 7.10 77.40 2006
1 Radial Velocity 1 874.8 2.21 56.95 2008
2 Transit 1 1.5 0.02 300.00 2012
3 Transit 2 2.2 0.03 150.50 2014
4 Imaging 1 4100.0 5.00 25.00 2005
5 Radial Velocity 1 763.0 2.60 19.84 2011
6 Microlensing 1 1000.5 3.40 4000.00 2013
7 Transit 3 3.5 0.01 80.00 2015
8 Imaging 2 2000.0 6.50 32.00 2010
9 Transit 1 1.0 0.02 75.00 2011

df.groupby('method')['mass'].mean()

mass

method

Imaging 5.75

Microlensing 3.40

Radial Velocity 3.97

Transit 0.02

dtype: float64

df.groupby('method')['mass'].aggregate(['count','mean', 'min', 'max'])

count mean min max

method

Imaging 2 5.75 5.00 6.50

Microlensing 1 3.40 3.40 3.40

Radial Velocity 3 3.97 2.21 7.10

Transit 4 0.02 0.01 0.03

df.groupby('year')['number'].sum()

number

year

2005 1

2006 1

2008 1

2010 2

2011 2

2012 1

2013 1

2014 2

2015 3

dtype: int64
df.groupby(['method','year']).size().unstack(fill_value=0)

year 2005 2006 2008 2010 2011 2012 2013 2014 2015

method

Imaging 1 0 0 1 0 0 0 0 0

Microlensing 0 0 0 0 0 0 1 0 0

Radial Velocity 0 1 1 0 1 0 0 0 0

Transit 0 0 0 0 1 1 0 1 1

df.groupby('method')['distance'].mean()

distance

method

Imaging 28.500000

Microlensing 4000.000000

Radial Velocity 51.396667

Transit 151.375000

dtype: float64

df.groupby('method')['distance'].aggregate(lambda x: x.max() - x.min())

distance

method

Imaging 7.00

Microlensing 0.00

Radial Velocity 57.56

Transit 225.00

dtype: float64

df.groupby('method').filter(lambda x: len(x)>2)

method number orbital_period mass distance year

0 Radial Velocity 1 269.3 7.10 77.40 2006

1 Radial Velocity 1 874.8 2.21 56.95 2008

2 Transit 1 1.5 0.02 300.00 2012

3 Transit 2 2.2 0.03 150.50 2014

5 Radial Velocity 1 763.0 2.60 19.84 2011

7 Transit 3 3.5 0.01 80.00 2015

9 Transit 1 1.0 0.02 75.00 2011

Pivot table
import pandas as pd
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})

df
table = pd.pivot_table(df, index=['A', 'B'])
table

A B

Boby Graduate 23.0

John Masters 27.0

Mina Graduate 21.0

Nicky Graduate 24.0

Peter Masters 23.0

import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]
})
table = pd.pivot_table(df, values='C', index='C', columns='B', aggfunc='sum')
print(table)

B Graduate Masters
C
21 Mina NaN
23 Boby Peter
24 Nicky NaN
27 NaN John

A B

Boby Graduate 23.0

John Masters 27.0

Mina Graduate 21.0

Nicky Graduate 24.0

Peter Masters 23.0

All 23.6

import pandas as pd

df = pd.DataFrame({'Product': ['Carrots', 'Broccoli', 'Banana', 'Banana',

'Beans', 'Orange', 'Broccoli', 'Banana'],
'Category': ['Vegetable', 'Vegetable', 'Fruit', 'Fruit',
'Vegetable', 'Fruit', 'Vegetable', 'Fruit'],
'Quantity': [8, 5, 3, 4, 5, 9, 11, 8],
'Amount': [270, 239, 617, 384, 626, 610, 62, 90]})
df

Product Category Quantity Amount

0 Carrots Vegetable 8 270

1 Broccoli Vegetable 5 239

2 Banana Fruit 3 617

3 Banana Fruit 4 384

4 Beans Vegetable 5 626

5 Orange Fruit 9 610

6 Broccoli Vegetable 11 62

7 Banana Fruit 8 90

pivot = df.pivot_table(index=['Product'],
values=['Amount'],
aggfunc='sum')
print(pivot)

Amount
Product
Banana 1091
Beans 626
Broccoli 301
Carrots 270
Orange 610

pivot = df.pivot_table(index=['Category'],
values=['Amount'],
aggfunc='sum')
print(pivot)

Amount
Category
Fruit 1701
Vegetable 1197

pivot = df.pivot_table(index=['Product', 'Category'],

values=['Amount'], aggfunc='sum')
print(pivot)

Amount
Product Category
Banana Fruit 1091
Beans Vegetable 626
Broccoli Vegetable 301
Carrots Vegetable 270
Orange Fruit 610

pivot = df.pivot_table(index=['Category'], values=['Amount'],

aggfunc={'median', 'mean', 'min'})
print(pivot)

Amount
mean median min
Category
Fruit 425.25 497.0 90
Vegetable 299.25 254.5 62

Start coding or generate with AI.

keyboard_arrow_down Vectorized String operations

Vectorized string operations in Pandas are powerful and efficient because they are optimized for performance and operate element-
wise on entire columns (i.e., Series) of string values without using loops.

In Pandas, you can access string methods using the .str accessor on a Series. Here's a clear overview with examples:

import pandas as pd

data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'city': ['New York', 'los angeles', 'Chicago', 'Houston', 'PHOENIX']
}

df = pd.DataFrame(data)

1. Case Conversation

df['name'].str.lower()

name

0 alice

1 bob

2 charlie

3 david

4 eva

dtype: object

df['city'].str.upper()

city

0 NEW YORK

1 LOS ANGELES

2 CHICAGO

3 HOUSTON

4 PHOENIX

dtype: object

df['name'].str.title()

name

0 Alice

1 Bob

2 Charlie

3 David

4 Eva

dtype: object

2. String Matching and Searching

contains()

df['name'].str.contains('o')
name

0 False

1 True

2 False

3 False

4 False

dtype: bool

startswith()

df['name'].str.startswith('A')

name

0 True

1 False

2 False

3 False

4 False

dtype: bool

endswith()

df['city'].str.endswith('a')

city

0 False

1 False

2 False

3 False

4 False

dtype: bool

df['name'].str.match('A.*')

name

0 True

1 False

2 False

3 False

4 False

dtype: bool

3. String Replacement

df['name'].str.replace('a','A')
name

0 Alice

1 Bob

2 ChArlie

3 DAvid

4 EvA

df['name'].str[0:4]
dtype: object

name

0 Alic

1 Bob

2 Char

3 Davi

4 Eva

dtype: object

df['name'].str.slice(0, 3)

name

0 Ali

1 Bob

2 Cha

3 Dav

4 Eva

dtype: object

df['city'].str.len()

city

0 8

1 11

2 7

3 7

4 7

dtype: int64

df['name'].str.split()

name

0 [Alice]

1 [Bob]

2 [Charlie]

3 [David]

4 [Eva]

dtype: object

Pandas Ds
No ratings yet
Pandas Ds
18 pages
Week 13 1-Pandas
No ratings yet
Week 13 1-Pandas
10 pages
Data Loading - Jupyter Notebook
No ratings yet
Data Loading - Jupyter Notebook
15 pages
Ml1.ipynb - Colaboratory
No ratings yet
Ml1.ipynb - Colaboratory
5 pages
Importing Files Through Pandas
No ratings yet
Importing Files Through Pandas
16 pages
PANDAS Intro 1
No ratings yet
PANDAS Intro 1
24 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
6 pages
Bigdata - Ipynb - Colab
No ratings yet
Bigdata - Ipynb - Colab
28 pages
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
No ratings yet
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
6 pages
Python Pandas-2
No ratings yet
Python Pandas-2
64 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
Data Science Lab Program Printout
No ratings yet
Data Science Lab Program Printout
43 pages
Data Analysis for Heart Disease
No ratings yet
Data Analysis for Heart Disease
1 page
DS (Pandas)
No ratings yet
DS (Pandas)
17 pages
Data Pre Processing 1
No ratings yet
Data Pre Processing 1
35 pages
Pandas for Data Science Beginners
No ratings yet
Pandas for Data Science Beginners
41 pages
Heart Diseases EDA
No ratings yet
Heart Diseases EDA
1 page
Pandas
No ratings yet
Pandas
18 pages
Ip Practical
No ratings yet
Ip Practical
23 pages
Decision Tree PBEL With GridSearchCV
No ratings yet
Decision Tree PBEL With GridSearchCV
12 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Ploomber Notebook Conversion - 2
No ratings yet
Ploomber Notebook Conversion - 2
14 pages
Practical File Ip
No ratings yet
Practical File Ip
27 pages
Various Gas (English Units)
No ratings yet
Various Gas (English Units)
4 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
Practical 1
No ratings yet
Practical 1
26 pages
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
No ratings yet
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
12 pages
Series 1400 Tank Capacity Chart
No ratings yet
Series 1400 Tank Capacity Chart
3 pages
Series 1400 Capacity Chart Nominal Volume M: Notes
No ratings yet
Series 1400 Capacity Chart Nominal Volume M: Notes
3 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Practical No 1 - Merged
No ratings yet
Practical No 1 - Merged
6 pages
Prac3.ipynb (Auto-R) - JupyterLab
No ratings yet
Prac3.ipynb (Auto-R) - JupyterLab
6 pages
ML FINAL Lab Manual
No ratings yet
ML FINAL Lab Manual
7 pages
OOP Lab: Kilos & Tuition Calculations
No ratings yet
OOP Lab: Kilos & Tuition Calculations
17 pages
Tugas Besar
No ratings yet
Tugas Besar
5 pages
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
No ratings yet
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
65 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Table A22
50% (2)
Table A22
2 pages
Dsa 1
No ratings yet
Dsa 1
8 pages
Chisquare
No ratings yet
Chisquare
9 pages
Data Frame Notes3
No ratings yet
Data Frame Notes3
39 pages
K Fold
No ratings yet
K Fold
2 pages
Astrology Levels Stock Setup
No ratings yet
Astrology Levels Stock Setup
222 pages
Diabetes Dataset Analysis & Prep
No ratings yet
Diabetes Dataset Analysis & Prep
11 pages
DS Manual 1
No ratings yet
DS Manual 1
96 pages
1 Simple Linear Regression
No ratings yet
1 Simple Linear Regression
9 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
7 pages
Tabel Chi-Square
No ratings yet
Tabel Chi-Square
1 page
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
LAB1 HTML
No ratings yet
LAB1 HTML
17 pages
EEC Notes
No ratings yet
EEC Notes
34 pages
Cycle 2 Record IP
No ratings yet
Cycle 2 Record IP
13 pages
Loan - Approval - Prediction - Ipynb - Colab
No ratings yet
Loan - Approval - Prediction - Ipynb - Colab
7 pages
Class12 IP Practical Solutions
No ratings yet
Class12 IP Practical Solutions
39 pages
Copeland 3DB3 0750 TFC Manual
No ratings yet
Copeland 3DB3 0750 TFC Manual
8 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Chi-Square Statistic Table
No ratings yet
Chi-Square Statistic Table
1 page
Functions - Colab
No ratings yet
Functions - Colab
3 pages
NumPy - Ds
No ratings yet
NumPy - Ds
15 pages
Architectural Design The Blueprint For Software Success 2
No ratings yet
Architectural Design The Blueprint For Software Success 2
8 pages
Supporting Look and Feel and Window Systems
No ratings yet
Supporting Look and Feel and Window Systems
13 pages
Project Schedule
No ratings yet
Project Schedule
11 pages
Lexi Design Patterns Presentation
No ratings yet
Lexi Design Patterns Presentation
7 pages
The Bridge Pattern Adapting Applications To Diverse Window Systems
No ratings yet
The Bridge Pattern Adapting Applications To Diverse Window Systems
8 pages
SAP-TCodes Module MDM-EN
No ratings yet
SAP-TCodes Module MDM-EN
8 pages
Process Synchronization Basics
No ratings yet
Process Synchronization Basics
58 pages
Keyboard
No ratings yet
Keyboard
18 pages
XML &amp DHTML
No ratings yet
XML &amp DHTML
2 pages
SQL Server Architecture Overview
No ratings yet
SQL Server Architecture Overview
17 pages
Social Media Management System Project Report
No ratings yet
Social Media Management System Project Report
92 pages
Software Engineer Resume
No ratings yet
Software Engineer Resume
3 pages
(Ijcst-V13i2p3) :dr.d.j.samatha Naidu, M.lahya
No ratings yet
(Ijcst-V13i2p3) :dr.d.j.samatha Naidu, M.lahya
3 pages
Företagspresentationion Wien2011
No ratings yet
Företagspresentationion Wien2011
43 pages
Forest Stack RFP
No ratings yet
Forest Stack RFP
41 pages
Part 6 Chapter 5
No ratings yet
Part 6 Chapter 5
64 pages
Alexander Tan: Expert Graphic Designer Profile
No ratings yet
Alexander Tan: Expert Graphic Designer Profile
1 page
Deep Security 95 Best Practice Guide Non-NSX
No ratings yet
Deep Security 95 Best Practice Guide Non-NSX
75 pages
Menu Driven Control of The Mini Mover 5 Robot PDF
No ratings yet
Menu Driven Control of The Mini Mover 5 Robot PDF
5 pages
SE - Revit Audit Checklist
No ratings yet
SE - Revit Audit Checklist
3 pages
Class 8 Qbasic Notes
100% (2)
Class 8 Qbasic Notes
5 pages
WERC Warehouse Management Systems Pres1 - BH
No ratings yet
WERC Warehouse Management Systems Pres1 - BH
16 pages
Als Tin Whistle Font
0% (1)
Als Tin Whistle Font
2 pages
State of OTT
No ratings yet
State of OTT
44 pages
Suzuki Equiry Max HLD v1.4
No ratings yet
Suzuki Equiry Max HLD v1.4
22 pages
Introduction To The Internet of Things: By-Sa
No ratings yet
Introduction To The Internet of Things: By-Sa
63 pages
Liebert Ita 16-20kva User Manual Ap11dpg-Ita16to20kva-V1-Um
No ratings yet
Liebert Ita 16-20kva User Manual Ap11dpg-Ita16to20kva-V1-Um
69 pages
Ps C:/Users/Faiza C/Users/Faiza/C/Firstrepo
No ratings yet
Ps C:/Users/Faiza C/Users/Faiza/C/Firstrepo
25 pages
Azhagi Keymapping
No ratings yet
Azhagi Keymapping
19 pages
How To Set Up Oracle IExpenses - The Feature
No ratings yet
How To Set Up Oracle IExpenses - The Feature
29 pages
NO FILE Enquiries: Technical Courses
No ratings yet
NO FILE Enquiries: Technical Courses
35 pages
Xii - Computer Science
No ratings yet
Xii - Computer Science
8 pages
Manual de Usuario GoLabel - II - UM
No ratings yet
Manual de Usuario GoLabel - II - UM
171 pages
Shell Script and AWK Solutions Guide
No ratings yet
Shell Script and AWK Solutions Guide
3 pages
Matsonic MS8127C
No ratings yet
Matsonic MS8127C
80 pages