A powerful, flexible framework for entity resolution and record linkage using DuckDB as the database engine built upon the work of Who Owns Chicago by the Mansueto Institute for Urban Innovation including the work of Kevin Bryson, Ana (Anita) Restrepo Lachman, Caitlin P., Joaquin Pinto, and Divij Sinha.
This package enables you to load data from various sources, clean and standardize entity names and addresses, and create links between entities based on exact and fuzzy matching techniques.
Source: https://github.com/mansueto-institute/mi-chainlink
Documentation: https://mansueto-institute.github.io/mi-chainlink/
Issues: https://github.com/mansueto-institute/mi-chainlink/issues
This framework helps you solve the entity resolution problem by:
- Loading data from multiple sources into a DuckDB database
- Cleaning and standardizing entity names and addresses
- Creating exact matches between entities based on names and addresses
- Generating fuzzy matches using TF-IDF similarity
- Exporting the resulting linked data for further analysis
The system is designed to be configurable through YAML files and supports incremental updates to an existing database.
Package is available on PyPI. You can install it using pip or uv:
pip install mi-chainlinkuv add mi-chainlink# Run interactive session
chainlink
# Run with path to config yaml
chainlink path/to/config.yamlfrom chainlink import chainlink
chainlink(
config: dict, ## dict with config details
config_path: str | Path = DIR / "configs/config.yaml", ## path to store dict post processing
)