Goal is to clean and standardize text values in a DataFrame by replacing patterns using regular expressions (regex). Instead of fixing each string manually, regex allows us to detect and update values that follow specific patterns.
First, let’s create a sample DataFrame that we’ll use in all the examples:
import pandas as pd
df = pd.DataFrame({ 'City': ['New York (City)', 'Parague', 'New Delhi (Delhi)', 'Venice', 'new Orleans'],
'Event': ['Music', 'Poetry', 'Theatre', 'Comedy', 'Tech_Summit'],
'Cost': [10000, 5000, 15000, 2000, 12000] })
df.index = [pd.Period('02-2018'), pd.Period('04-2018'),
pd.Period('06-2018'), pd.Period('10-2018'), pd.Period('12-2018')]
print(df)
Output
City Event Cost
2018-02 New York (City) Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi (Delhi) Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000
Now, let’s explore different methods to replace values using regex in Pandas.
Using DataFrame.replace() with regex
The replace() function in Pandas can directly handle regex patterns. It scans the entire column for matches and replaces them in a single operation, making it both concise and efficient.
In this example, city names starting with "New" or "new" are replaced with "New_".
df_updated = df.replace(to_replace='[nN]ew', value='New_', regex=True)
print(df_updated)
Output
City Event Cost
2018-02 New_ York (City) Music 10000
2018-04 Parague Poetry 5000
2018-06 New_ Delhi (Delhi) Theatre 15000
2018-10 Venice Comedy 2000
2018-12 New_ Orleans Tech_Summit 12000
Explanation: The regex [nN]ew matches both "New" and "new", replacing them with "New_" across the entire DataFrame column.
Using apply() with a custom regex function
The apply() function lets you define a custom function that uses Python’s re module for pattern matching and string replacement. This is useful when the cleanup logic is more complex than simple substitution.
In this example, city names containing additional details inside brackets (e.g., "New York (City)") are cleaned by removing the bracketed part.
import re
def clean_city(name):
return re.sub(r"\(.*\)", "", name).strip()
df['City'] = df['City'].apply(clean_city)
print(df)
Output
City Event Cost
2018-02 New York Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000
Explanation: The regex \(.*\) matches any text inside parentheses, and re.sub() removes it. Using apply(), this cleanup is applied element-wise to the City column.