Codestin Search App

Goal is to clean and standardize text values in a DataFrame by replacing patterns using regular expressions (regex). Instead of fixing each string manually, regex allows us to detect and update values that follow specific patterns.

First, let’s create a sample DataFrame that we’ll use in all the examples:

Python

import pandas as pd
df = pd.DataFrame({ 'City': ['New York (City)', 'Parague', 'New Delhi (Delhi)', 'Venice', 'new Orleans'],
                    'Event': ['Music', 'Poetry', 'Theatre', 'Comedy', 'Tech_Summit'],
                    'Cost': [10000, 5000, 15000, 2000, 12000] })
                    
df.index = [pd.Period('02-2018'), pd.Period('04-2018'), 
            pd.Period('06-2018'), pd.Period('10-2018'), pd.Period('12-2018')]
print(df)

Output

City Event Cost
2018-02 New York (City) Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi (Delhi) Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000

Now, let’s explore different methods to replace values using regex in Pandas.

Using DataFrame.replace() with regex

The replace() function in Pandas can directly handle regex patterns. It scans the entire column for matches and replaces them in a single operation, making it both concise and efficient.

In this example, city names starting with "New" or "new" are replaced with "New_".

Python

df_updated = df.replace(to_replace='[nN]ew', value='New_', regex=True)
print(df_updated)

Output

City Event Cost
2018-02 New_ York (City) Music 10000
2018-04 Parague Poetry 5000
2018-06 New_ Delhi (Delhi) Theatre 15000
2018-10 Venice Comedy 2000
2018-12 New_ Orleans Tech_Summit 12000

Explanation: The regex [nN]ew matches both "New" and "new", replacing them with "New_" across the entire DataFrame column.

Using apply() with a custom regex function

The apply() function lets you define a custom function that uses Python’s re module for pattern matching and string replacement. This is useful when the cleanup logic is more complex than simple substitution.

In this example, city names containing additional details inside brackets (e.g., "New York (City)") are cleaned by removing the bracketed part.

Python

import re
def clean_city(name):
    return re.sub(r"\(.*\)", "", name).strip()

df['City'] = df['City'].apply(clean_city)
print(df)

Output

City Event Cost
2018-02 New York Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000

Explanation: The regex \(.*\) matches any text inside parentheses, and re.sub() removes it. Using apply(), this cleanup is applied element-wise to the City column.

Replace Values in Pandas Dataframe using Regex

Using DataFrame.replace() with regex

Using apply() with a custom regex function

Explore