Welcome to the official repository for M3-20M, the first large-scale Multi-Modal Molecular dataset, containing over 20 million molecules! 🎉 If our dataset is useful for your work, please cite our paper or give our GitHub project a star. Your support is very helpful to our work!
The dataset is available for download from multiple sources:
- Google Drive: Download Link
- Hugging Face: Download Link
- Baidu Cloud: Download Link password:ADMS
M3-20M (Multi-Modal Molecular dataset) is designed to support AI-driven drug design and discovery. It provides an unprecedented scale that highly benefits the training or fine-tuning of large models for superior performance in drug design and discovery tasks.
- Scale: Contains over 20 million molecules, 71 times more than the largest existing dataset.
- Comprehensive Modalities:
- One-dimensional SMILES strings
- Two-dimensional molecular graphs
- Three-dimensional molecular structures
- Physicochemical properties
- Text descriptions
- Diverse Applications: Supports various downstream tasks such as molecule generation, molecular property prediction, lead optimization, virtual screening, pharmacokinetics modeling, and drug-target interaction prediction.
M3-20M integrates data from multiple sources to provide a comprehensive view of each molecule. Here’s what you can find in the dataset:
- M^3_Original.csv: Descriptions from PubChem
- M^3_Physicochemical.csv: Physicochemical properties
- M^3_Description_Physicochemical.csv: Descriptions composed of physicochemical properties
- M^3_Multi.csv: Descriptions from PubChem, physicochemical properties, and those generated by GPT-3.5
- MPP folder: Contains multimodal datasets for molecular property prediction (BBBP-MM, BACE-MM, HIV-MM, ClinTox-MM, Tox21-MM)
- MOSES-Multi folder: Contains MOSES multimodal datasets for molecular generation
- QM9-Multi folder: Contains QM9 multimodal datasets
We provide convenient functions that allow you to easily obtain the dataset, as well as the 2D and 3D representations of any molecule outside the dataset. The specific functions can be found in the Function folder.
Here’s a simple example of how to load and explore the dataset:
import pandas as pd
# Load the dataset
df = pd.read_csv('path-to-dataset.csv')
# Display the first few rows
print(df.head())
We welcome contributions from the community! Feel free to submit issues or pull requests to help improve the dataset and its applications.
This project is licensed under the MIT License - see the LICENSE file for details.
We gratefully acknowledge the use of data from the PubChem, ZINC, and QM9 databases in this study. The SMILES data utilized in our work were sourced from these essential resources, which provide invaluable chemical information. We appreciate their efforts in compiling and maintaining comprehensive datasets. If our dataset proves helpful in your research endeavors, please remember to cite PubChem, ZINC, and QM9 accordingly.
This dataset is a collaborative effort by researchers from Tongji University and Fudan University. We thank Siyuan Guo, Lexuan Wang, Chang Jin, Jinxian Wang, Han Peng, Huayang Shi, Wengen Li, Jihong Guan, and Shuigeng Zhou for their contributions and support.
For any questions or inquiries, please reach out to [email protected].
Enjoy using M3-20M and happy researching! 🚀🔬
If you use the M3-20M dataset in your research, please cite our paper:
BibTeX Citation
@article{doi:10.1142/S0219720025500064,
author = {Guo, Siyuan and Wang, Lexuan and Jin, Chang and Wang, Jinxian and Peng, Han and Shi, Huayang and Li, Wengen and Guan, Jihong and Zhou, Shuigeng},
title = {M3-20M: A large-scale multi-modal molecule dataset for AI-driven drug design and discovery},
journal = {Journal of Bioinformatics and Computational Biology},
volume = {23},
number = {02},
pages = {2550006},
year = {2025},
doi = {10.1142/S0219720025500064},
note ={PMID: 40494666}}