You work as a data analyst for an e-commerce company and have received a customer database that contains duplicate entries. Your task is to clean the data by removing duplicate rows based on email addresses.
Given a DataFrame customers with columns:
customer_id(int) - Unique identifier for each customer recordname(object) - Customer's nameemail(object) - Customer's email address
Goal: Remove all duplicate rows where the same email appears multiple times, keeping only the first occurrence of each unique email.
This is a common data preprocessing task in machine learning pipelines and business analytics where data quality is crucial for accurate insights.
Input & Output
Visualization
Time & Space Complexity
Single pass through the data with hash table lookups
Where k is the number of unique emails (typically much less than n)
Constraints
- 1 โค customers.length โค 104
- customer_id is a positive integer
- name and email are non-empty strings
- Email addresses are case-sensitive
- The first occurrence of each unique email should be preserved