MolGenBench: CodeBase for "Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench"

🔔 News

[2026-01-09] We have released Version 3 of the dataset. In this release, we added pre-computed evaluation CSV results to each results folder. We also uploaded the aggregated results under the paper_results folder in the repository.

[2025-12-12] We have released Version 2 of the dataset. Compared with Version 1, this update additionally includes precomputed InChI and SMILES for active molecules, which significantly accelerates the HitRediscovery calculation.

🛠️ Environment Setup

conda create --name MolGenBench python=3.11
mamba install -c conda-forge numpy pandas seaborn scipy -y
pip install --use-pep517 EFGs
pip install rdkit==2025.9.1 prolif==2.0.3 mdanalysis==2.7.0
pip install posebusters==0.3.1 spyrmsd
pip install tqdm joblib pytest swifter medchem
mamba install -c conda-forge lilly-medchem-rules
mamba install openbabel

# for vina docking
pip install meeko==0.1.dev3 scipy pdb2pqr vina
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3

# for posecheck evaluation
git clone https://github.com/cch1999/posecheck.git
cd posecheck
git checkout 57a1938  # the calculation of strain energy used in our paper
pip install -e .
pip install -r requirements.txt
conda install -c mx reduce

📦 Datasets & Benchmark Results

Please download from Zenodo dataset the result on your device and unzip the files. The downloaded dataset already follows the required folder structure, so you can directly use it for evaluation without any reorganization.

📁 Required Directory Structure

Note

If you generate molecules with your own model, please ensure that your output is saved following the same directory structure as the official dataset. Below is the expected structure for each UniProt target:

P12345/     # Uniprot ID
├─ reference_active_molecules/
├─ Round1/
│  ├─ De_novo_Results/
│  │  └─ <YOUR_MODEL_NAME>/
│  │     ├─ results/    # Evaluation results will be saved here
│  │     ├─ P12345_<YOUR_MODEL_NAME>.sdf
│  │     └─ P12345_<YOUR_MODEL_NAME>_vina_docked.sdf     # Docked pose generated using vina_docking.py
│  │
│  └─ Hit_to_Lead_Results/
│     └─ Sries001/      # Series ID
│        └─ <YOUR_MODEL_NAME>/
│           ├─ results/ 
│           ├─ P12345_Sries001_<YOUR_MODEL_NAME>.sdf
│           └─ P12345_Sries001_<YOUR_MODEL_NAME>_vina_docked.sdf
│
├─ P12345_prep.pdb
├─ P12345_pocket10.pdb
└─ P12345_lig.sdf

🧬 Run Generation with Your Model

This section explains how to use the MolGenBench data structure to run your own molecule generation model. After molecules are generated, they can be evaluated with the unified MolGenBench evaluation pipeline.

MolGenBench supports two generation scenarios:

De novo design
Hit-to-lead optimization

De novo design

Input requirement:

You may use either the full protein (*_prep.pdb) or the pocket file (*_pocket10.pdb) as model input; the ligand file (*_lig.sdf) is used to define the pocket position.

Generation rules:

Set the generation to 1000 samples per UniProt ID.
Do NOT perform any filtering or post-processing beyond what your model naturally produces.
→ If your model produces fewer than 1000 valid molecules, keep all valid molecules it can generate.
Save the generated SDF file as: Generated molecules should be saved as <UniprotID>_<YOUR_MODEL_NAME>.sdf; each molecule must have its _Name field set to a unique index (0, 1, 2, …) to distinguish individual samples.

Hit-to-lead optimization

Hit-to-lead optimization is performed per chemical series.

Generation rules:

Set the generation to 200 samples per Series ID.
Do NOT perform any filtering or post-processing beyond what your model naturally produces.
→ If your model produces fewer than 200 valid molecules, keep all valid molecules it can generate.
Save the generated SDF file as: Generated molecules should be saved as <UniprotID>_<SeriesID>_<YOUR_MODEL_NAME>.sdf; each molecule must have its _Name field set to a unique index (0, 1, 2, …) to distinguish individual samples.

Depending on your model type, use different inputs:

1. Structure-based generation models

Input requirement:

Protein pocket: <UniprotID>_pocket10.pdb or <UniprotID>_prep.pdb
Reference ligand pose for the specific series: <UniprotID>_<SeriesID>_reference_ligand_pose_with_h.sdf
Conserved scaffold extraction for model input: The conserved core scaffold is extracted using the corresponding scaffold SMARTS defined in top5_common_scaffold_info.csv for each series and provided as the input to the model. We provide a concrete example in get_h2l_scaf.py.

2. Ligand-based generation models

Input requirement:

Use ONLY:

<UniprotID>_<SeriesID>_reference_ligand_pose_with_h.sdf

⚙️ Running the Evaluation

After generating molecules with your model, you can evaluate them using the unified evaluation pipeline provided in eval.py.

python eval.py \
    --data_path "/path/to/data" \
    --round_name "Round1" \
    --mode "De_novo_Results" \  # or "Hit_to_Lead_Results"
    --model_name "YOUR_MODEL_NAME" \
    --fixStereoFrom3D # close this if your model direct output smiles

You can comment out metrics in eval.py that you do not wish to compute. For example:

evaluator = Evaluator(
    [
        "Validity", 
        "QED",
        "SA",
        "Uniqueness",
        "Diversity",
        "MotifDist",
        "ChemFilter",
        "HitRediscover",

        # ----below are 3D metrics----
        # "PoseBuster", # comment out if 3D checks are not needed
        # "StrainEnergy",
        # "RMSD",
        # "InteractionScore",
        # "ClashScore",
        
      ]
)

Tip

RMSD metrics require the *_vina_docked.sdf files.
Please run vina_docking.py to generate the docked poses before evaluation. You can generate them by running:

python vina_docking.py \
   --data_path "/path/to/data" \
   --output_path "/path/to/output" \
   --round_name "Round1" \
   --mode "De_novo_Results" \ or "Hit_to_Lead_Results"
   --model_name "YOUR_MODEL_NAME"

📊 Visualizing Final Results (Notebook)

   denovo hit rate (please replace the name of dir)
   relative_dir/FigShow/Denovo_hit_recovery/Deonovo_repeats_hit_rate_boxplot.ipynb

    denovo hit fraction
   relative_dir/FigShow/Denovo_hit_recovery/Deonovo_repeats_hit_fraction_boxplot.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
FigShow		FigShow
TestSamples		TestSamples
molgenbench		molgenbench
paper_results		paper_results
src		src
sup_info		sup_info
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
get_h2l_scaf.py		get_h2l_scaf.py
test_evaluator.py		test_evaluator.py
vina_docking.py		vina_docking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolGenBench: CodeBase for "Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench"

🔔 News

🛠️ Environment Setup

📦 Datasets & Benchmark Results

📁 Required Directory Structure

🧬 Run Generation with Your Model

De novo design

Hit-to-lead optimization

1. Structure-based generation models

2. Ligand-based generation models

⚙️ Running the Evaluation

📊 Visualizing Final Results (Notebook)

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

CAODH/MolGenBench

Folders and files

Latest commit

History

Repository files navigation

MolGenBench: CodeBase for "Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench"

🔔 News

🛠️ Environment Setup

📦 Datasets & Benchmark Results

📁 Required Directory Structure

🧬 Run Generation with Your Model

De novo design

Hit-to-lead optimization

1. Structure-based generation models

2. Ligand-based generation models

⚙️ Running the Evaluation

📊 Visualizing Final Results (Notebook)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages