Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DS4SD/PatCID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PatCID

This is the repository for PatCID: an open-access dataset of chemical structures in patent documents. PatCID is a dataset of molecules linked to the patent document displaying them.

MolGrapher

Citation

If you find this repository useful, please consider citing:

@article{Morin2024,
	title        = {{PatCID: an open-access dataset of chemical structures in patent documents}},
	author       = {Morin, Lucas and Weber, Val{\'e}ry and Meijer, Gerhard Ingmar and Yu, Fisher and Staar, Peter W. J.},
	year         = 2024,
	month        = {Aug},
	day          = {02},
	journal      = {Nature Communications},
	volume       = 15,
	number       = 1,
	pages        = 6532,
	doi          = {10.1038/s41467-024-50779-y},
	issn         = {2041-1723},
	url          = {https://doi.org/10.1038/s41467-024-50779-y}
}

Installation

Create a virtual environment.

conda create -n patcid python=3.11
conda activate patcid

Install poppler.

Linux: apt-get install poppler-utils 
Mac: brew install poppler 

Install python dependencies.

pip install -e .

Download PatCID Dataset

The PatCID dataset is available on Zenodo.

wget https://zenodo.org/records/10572870/files/patcid.zip?download=1 -O patcid.zip
unzip patcid.zip -d ./data/patcid/

(Download size: 5.7 GB, files format: .jsonl)

Document Retrieval

Run the notebook ./examples/molecule_query.ipynb to use PatCID to retrieve documents referencing a molecule of interest.

Molecule Retrieval

Run the notebook ./examples/patent_query.ipynb to use PatCID to retrieve molecules displayed in a given patent document.

User Interface

user_interface.mp4

To request access to the above user interface, please contact the IBM's Deep Search team at [email protected].

Benchmark Datasets

The benchmarks datasets D2C-UNI and D2C-RND are available on Zenodo.

Code

The code repositories used to build and evaluate PatCID are available:

For segmenting chemical-structure images from documents, we use DECIMER Segmentation from K. Rajan, H. O. Brinkhaus, M. Sorokina, A. Zielesny and C. Steinbeck.

Models

The model weights are available on Hugging Face:

Training Datasets

The training datasets are available on Zenodo and Hugging Face:

Additional Visualization

To test our processing pipeline outside its main application domain, we process a scientific publication published on ChemRxiv. ./data/extra/scientific_paper_example/ contains the pages of the document (page_*.png) annotated with the segmentation and classification predictions. For pages containing molecules, the predicted molecules are provided in page_*_molecules.txt.

About

[Nat. Commun.] PatCID: an open-access dataset of chemical structures in patent documents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages