A lightweight experimental generative model for chemistry, built on mini Qwen2-like backbone with multi-horizon predictive loss for molecular SELFIES representations.
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
A custom Qwen2-style language model, adapted for molecular generation:
- β Qwen2-like Architecture β Modernized backbone with efficient attention
- β Multi-Token Prediction (MTP Head) β Predicts multiple future tokens (1β3) in parallel
- β Horizon Loss β Weighted multi-horizon objectives for longer-term sequence coherence
- β SELFIES-native Tokenizer β Robust encoding for valid molecular structures with FastChemTokenizer
- β Ranger21 Optimizer β Adaptive optimizer with warmup/warmdown scheduling
- β Gradient Checkpointing β Trainable on smaller GPUs
- β Streaming Datase Loader β Trainable on smaller RAM
Experimental RL PPO-KL-ready features:
- β Enhanced Reward Functions β Validity, Lipinski, charge neutrality, diversity, complexity
- β Curriculum Learning β Gradually increases generation length during training
- β Adaptive KL + Entropy Control β Stabilizes reinforcement learning fine-tuning
π‘ Target domain: chemistry & molecular generation (SELFIES).
π Architecture is potentially generalizable to other sequence domains.
Pre-trained model's (non-RL) description:
Model has 9,854,851 trainable parameters.
Input shape: torch.Size([2, 32])
Logits shape: torch.Size([2, 32, 782])
Trained on 14k samples from combined curated dataset built from COCONUTDB (Sorokina et al., 2021),
ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) dataset
Batch Size: 16 (* 4 Grad acc -> ~64)
Optimizer: Ranger21 (MADGRAD-Lookahead-AdaBelief with gradient centralization, linear warm up (22%),
gradient clipping, and L2 weight decay)
Learning rate: 5e-06 (**Warmup complete - lr set to 3.9e-06)
Training log for E-1:
Warm-up
time step loss eval_loss
2025-09-21 11:21:20 3 26.5189
2025-09-21 11:21:26 6 25.7779
2nd phase with MTP
time step loss eval_loss
2025-09-21 11:52:07 140
2025-09-21 11:54:26 175 20.4449
2025-09-21 11:54:41 175 2.687195301055908
2025-09-21 12:05:43 350 10.405
2025-09-21 12:05:58 350 1.9965996742248535
2025-09-21 12:17:16 525 8.9447
2025-09-21 12:17:31 525 1.8333336114883423
2025-09-21 12:28:34 700 8.2911
2025-09-21 12:28:49 700 1.7291985750198364
2025-09-21 12:28:51 700
Hardware it was trained on: Laptop with NVIDIA GeForce 930M GPU (2GB VRAM), RAM 12 GB, 2 cores Intel i3, SSD
- Clone this repository
- Make sure you have the requierements installed
- Configurable via
config.json - Run
python train-withmtp.py - Demo for generation with rendered mol image included in
demo_test_mtpresult.ipynb- For demo please extract the
pretrained.7zarchive
- For demo please extract the
- For testing the prototype PPO-KL RL fine-tuning, try running
train_ppokl_selfies.pyon the pretrained model (please make sure the model location is correct)
Tip: feel free to play around with the ChemQ3Model and its training loop/configs! The sample dataset is included so you can experiment with it~ especially if you have better compute than mine, feel free to share your results in discussion
- Adjust FastChemTokenizer tokenizer on new data
- Experimenting with early architecture
- Write initial readme
- Upload backbone and MTP train code
- Demo training on 14K data (only 1 epoch, adding another on this data led to slight overfitting)
- Upload the warmup model
- Tidy up and upload JupyterNotebook(s) train/demo along with sample data
- [ongoing] Review, clean, and test codes
- Pretraining again after auditing/reviewing the base code
- Test RL code
- Train for 1000 steps for max token length = 80
- Upload RL-trained demo model
- Ablation studies
- Implement HF Automodel compatible modules if performance benefit(s) confirmed
- Complete pretraining on all ~3M dataset (when possible)
- Chunk I
- Chunk II
- Chunk III
- Chunk IV
- Publish complete pretraining on GitHub and HF (if compatible)
- Complete RL fine-tuning on verified rewards system.
ChemMiniQ3-HoriFIE/
βββ ChemQ3MTP.py # Custom model definition
|ββ train-withmtp.py # Main trainer for MTP with curriculum training combining NTP with MTP
|ββ config.json # Configuration for model definition and training
|ββ FastChemTokenizer.py # FastChemTokenizer module
|ββ train_ppokl_selfies.py # Prototype PPO-KL RL training script
βββ README.md
βββ requirements.txt # I'd recommend making a conda env for this or you could try using different
versions and please note if you encounter a bug
βββ selftok_core # FastChemTokenizer: SELFIES core used for this model, you can try _wtails
if you want to experiment
βββ pretrained/
βββ sample-e1/ # Pre-trained weights on sample 14k dataset, 1st epoch
βββ sample-RL/
βββ demo_test_mtpresult.ipynb # Demo script for generating SELFIES using pretrained model
βββ log_train.txt # Pre-training console outputs on MTP train
βββ data/ # 14k samples from combined dataset
This project is a learning experiment β all contributions are welcome!
- π§ Have a better way to implement the methods?
- π Want to add evaluation metrics?
- β¨ Found a bug? Please open an issue!
π Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.
This is NOT a production model.
- Built during late-night prototyping sessions π
- Not thoroughly validated or benchmarked due to compute constraint
- Some components are heuristic and unproven
- May crash, overfit, or generate nonsense (especially outside molecular data)
- Iβm still learning PyTorch, attention mechanisms, and transformer internals
Use this code to learn and experiment β not to deploy.
MIT
Based and Inspired by:
- https://github.com/KellerJordan/modded-nanogpt
- https://huggingface.co/docs/transformers/en/model_doc/t5gemma
- https://github.com/aspuru-guzik-group/selfies/
- https://github.com/lessw2020/Ranger21
- https://huggingface.co/gbyuvd/chemfie-gpt-experiment-1
- https://huggingface.co/gbyuvd/bionat-selfies-gen-tokenizer-wordlevel
- Old ChemZiRo-GPT experiment with adding RoPE, GQA, MTP, RMSProp to backbone GPT2 architecture
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}