This is the repository for PatCID: an open-access dataset of chemical structures in patent documents. PatCID is a dataset of molecules linked to the patent document displaying them.
If you find this repository useful, please consider citing:
@article{Morin2024,
title = {{PatCID: an open-access dataset of chemical structures in patent documents}},
author = {Morin, Lucas and Weber, Val{\'e}ry and Meijer, Gerhard Ingmar and Yu, Fisher and Staar, Peter W. J.},
year = 2024,
month = {Aug},
day = {02},
journal = {Nature Communications},
volume = 15,
number = 1,
pages = 6532,
doi = {10.1038/s41467-024-50779-y},
issn = {2041-1723},
url = {https://doi.org/10.1038/s41467-024-50779-y}
}
Create a virtual environment.
conda create -n patcid python=3.11
conda activate patcid
Install poppler.
Linux: apt-get install poppler-utils
Mac: brew install poppler
Install python dependencies.
pip install -e .
The PatCID dataset is available on Zenodo.
wget https://zenodo.org/records/10572870/files/patcid.zip?download=1 -O patcid.zip
unzip patcid.zip -d ./data/patcid/
(Download size: 5.7 GB, files format: .jsonl)
Run the notebook ./examples/molecule_query.ipynb to use PatCID to retrieve documents referencing a molecule of interest.
Run the notebook ./examples/patent_query.ipynb to use PatCID to retrieve molecules displayed in a given patent document.
user_interface.mp4
To request access to the above user interface, please contact the IBM's Deep Search team at [email protected].
The benchmarks datasets D2C-UNI and D2C-RND are available on Zenodo.
The code repositories used to build and evaluate PatCID are available:
For segmenting chemical-structure images from documents, we use DECIMER Segmentation from K. Rajan, H. O. Brinkhaus, M. Sorokina, A. Zielesny and C. Steinbeck.
The model weights are available on Hugging Face:
- The classification model
- The recognition model.
The training datasets are available on Zenodo and Hugging Face:
To test our processing pipeline outside its main application domain, we process a scientific publication published on ChemRxiv. ./data/extra/scientific_paper_example/ contains the pages of the document (page_*.png) annotated with the segmentation and classification predictions. For pages containing molecules, the predicted molecules are provided in page_*_molecules.txt.