Codestin Search App

Drop Duplicate Rows - Problem

Database Easy

You work as a data analyst for an e-commerce company and have received a customer database that contains duplicate entries. Your task is to clean the data by removing duplicate rows based on email addresses.

Given a DataFrame customers with columns:

customer_id (int) - Unique identifier for each customer record
name (object) - Customer's name
email (object) - Customer's email address

Goal: Remove all duplicate rows where the same email appears multiple times, keeping only the first occurrence of each unique email.

This is a common data preprocessing task in machine learning pipelines and business analytics where data quality is crucial for accurate insights.

Input & Output

example_1.py — Basic duplicate removal

$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Alice'], 'email': ['[email protected]', '[email protected]', '[email protected]'] })

› Output: customer_id name email 0 1 Alice [email protected] 1 2 Bob [email protected]

💡 Note: The third row is removed because '[email protected]' already appeared in the first row. We keep the first occurrence and remove subsequent duplicates.

example_2.py — Multiple duplicates

$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'], 'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'] })

› Output: customer_id name email 0 1 Alice [email protected] 1 2 Bob [email protected] 2 4 Charlie [email protected]

💡 Note: Rows 3 and 5 are removed as duplicates. Row 3 has the same email as row 1, and row 5 has the same email as row 2. Only the first occurrence of each unique email is retained.

example_3.py — No duplicates edge case

$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'email': ['[email protected]', '[email protected]', '[email protected]'] })

› Output: customer_id name email 0 1 Alice [email protected] 1 2 Bob [email protected] 2 3 Charlie [email protected]

💡 Note: All emails are unique, so no rows are removed. The original DataFrame is returned unchanged.

Visualization

Tap to expand

Understanding the Visualization

Start with customer database

Begin with a DataFrame containing customer records with potential duplicates

Track seen emails

Use a hash set to remember which email addresses we've already encountered

Process each row

For each customer record, check if their email is already in our seen set

Keep unique records

If email is new, add it to seen set and include record in result

Skip duplicates

If email already exists, skip this record to eliminate the duplicate

Key Takeaway

🎯 Key Insight: Hash-based deduplication achieves O(n) time complexity by using constant-time lookups to track previously seen email addresses, making it optimal for large datasets.

Time & Space Complexity

Time Complexity

⏱️

O(n)

Single pass through the data with hash table lookups

✓ Linear Growth

Space Complexity

O(k)

Where k is the number of unique emails (typically much less than n)

✓ Linear Space

Constraints

1 ≤ customers.length ≤ 10⁴
customer_id is a positive integer
name and email are non-empty strings
Email addresses are case-sensitive
The first occurrence of each unique email should be preserved

Asked in

f Meta 45 G Google 38 a Amazon 32 ⊞ Microsoft 28 N Netflix 22

The optimal solution uses pandas' drop_duplicates() method with subset=['email'] parameter to remove duplicate rows based on email addresses. This leverages internal hash tables for O(n) time complexity and keeps the first occurrence of each unique email. For custom implementations, use a hash set to track seen emails in a single pass through the data.

Common Approaches

Approach	Time	Space	Notes
✓ Built-in Method (Optimal)	O(n)	O(k)	Use pandas drop_duplicates() method for optimal performance
Brute Force (Manual Comparison)	O(n²)	O(n)	Compare each row with all other rows to identify duplicates

Built-in Method (Optimal) — Algorithm Steps

Call drop_duplicates() method on the DataFrame
Specify subset=['email'] to only consider email column
Use keep='first' parameter to retain first occurrence
Reset index if needed for clean output

Visualization

Tap to expand

Step-by-Step Walkthrough

Initialize empty hash set

Create a set to track seen email addresses

Process each row

Check if email exists in set - O(1) lookup

Add unique emails

If new email, add to set and include row in result

Skip duplicates

If email already seen, skip the row entirely

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_CUSTOMERS 1000
#define MAX_EMAIL_LEN 100
#define HASH_SIZE 1000

typedef struct {
    int customer_id;
    char name[100];
    char email[MAX_EMAIL_LEN];
} Customer;

// Simple hash function for strings
unsigned int hash(const char* str) {
    unsigned int hash = 5381;
    int c;
    while ((c = *str++)) {
        hash = ((hash << 5) + hash) + c;
    }
    return hash % HASH_SIZE;
}

// Hash table entry
typedef struct HashEntry {
    char email[MAX_EMAIL_LEN];
    int exists;
    struct HashEntry* next;
} HashEntry;

HashEntry* hashTable[HASH_SIZE];

void initHashTable() {
    for (int i = 0; i < HASH_SIZE; i++) {
        hashTable[i] = NULL;
    }
}

int insertAndCheck(const char* email) {
    unsigned int index = hash(email);
    HashEntry* entry = hashTable[index];
    
    // Check if email already exists
    while (entry != NULL) {
        if (strcmp(entry->email, email) == 0) {
            return 0; // Already exists
        }
        entry = entry->next;
    }
    
    // Insert new email
    HashEntry* newEntry = (HashEntry*)malloc(sizeof(HashEntry));
    strcpy(newEntry->email, email);
    newEntry->exists = 1;
    newEntry->next = hashTable[index];
    hashTable[index] = newEntry;
    
    return 1; // New email added
}

int dropDuplicateEmails(Customer* customers, int size, Customer* result) {
    initHashTable();
    int uniqueCount = 0;
    
    for (int i = 0; i < size; i++) {
        if (insertAndCheck(customers[i].email)) {
            result[uniqueCount] = customers[i];
            uniqueCount++;
        }
    }
    
    return uniqueCount;
}

void freeHashTable() {
    for (int i = 0; i < HASH_SIZE; i++) {
        HashEntry* entry = hashTable[i];
        while (entry != NULL) {
            HashEntry* temp = entry;
            entry = entry->next;
            free(temp);
        }
    }
}

int main() {
    Customer customers[] = {
        {1, "Alice", "[email protected]"},
        {2, "Bob", "[email protected]"},
        {3, "Alice", "[email protected]"},
        {4, "Charlie", "[email protected]"},
        {5, "Bob", "[email protected]"}
    };
    
    Customer result[MAX_CUSTOMERS];
    int uniqueCount = dropDuplicateEmails(customers, 5, result);
    
    printf("Unique customers:\n");
    for (int i = 0; i < uniqueCount; i++) {
        printf("ID: %d, Name: %s, Email: %s\n", 
               result[i].customer_id, result[i].name, result[i].email);
    }
    
    freeHashTable();
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n)

Single pass through the data with hash table lookups

✓ Linear Growth

Space Complexity

O(k)

Where k is the number of unique emails (typically much less than n)

✓ Linear Space

Constraints

1 ≤ customers.length ≤ 10⁴
customer_id is a positive integer
name and email are non-empty strings
Email addresses are case-sensitive
The first occurrence of each unique email should be preserved

42.3K Views

High Frequency

~8 min Avg. Time

1.8K Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Input & Output

Visualization

Time & Space Complexity

Related Problems

Constraints

Common Approaches

Built-in Method (Optimal) — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Constraints

Select Compiler