Match companies between two datasets based on company names and locations, and produce a merged output.
- Python
- pandas, numpy, matplotlib
- Match companies between Dataset 1 and Dataset 2 based on:
- company name
- location information
- Create a merged dataset that:
- contains all unique companies from Dataset 1
- includes corresponding company matches from Dataset 2 where they exist
- contains column with list of locations for company from Dataset 1
- contains column with list of locations for company from Dataset 2
- contains column with overlapping locations between two companies
- if no locations overlap – keep company name match, and leave overlapping locations column empty
- Calculate following metrics:
- match rate: % of Dataset 1 companies that have a match in Dataset 2
- unmatched records: % of companies with no match in either dataset
- one-to-many matches: % of companies with multiple matched entries
- other metrics you consider useful
- Merged dataset (CSV)
- Code scripts
- Documentation:
- matching approach
- data quality issues found
- normalization / transformations applied
- calculated metrics