Joonyoung F. Joung*, Mun Hong Fong*, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake, Connor W. Coley
NOW published in Nature!
“Electron flow matching for generative reaction mechanism prediction.” Nature 645, 115–123 (2025).
DOI: 10.1038/s41586-025-09426-9
FlowER uses flow matching to model chemical reaction as a process of electron redistribution, conceptually aligns with arrow-pushing formalisms. It aims to capture the probabilistic nature of reactions with mass conservation where multiple outcomes are reached through branching mechanistic networks evolving in time.
Ubuntu: >= 16.04
conda: >= 4.0
GPU: at least 25GB Memory with CUDA >= 12.2
$ conda create -n flower python=3.10
$ conda activate flower
$ pip install -r requirements.txtFlowER is trained on a combination of subset of USPTO-FULL (Dai et al.), RmechDB and PmechDB (Baldi et al.).
To retrain/reproduce FlowER, download data.zip and checkpoints.zip folder from this link, and unzip them, and place under FlowER/
The folder structure for the data folder is data/{DATASET_NAME}/{train,val,test}.txt and checkpoints folder is checkpoints/{DATASET_NAME}/{EXPERIMENT_NAME}/model.{STEP}_{IDX}.pt
The workflow of FlowER revolves mainly around 2 files. run_FlowER_large_(old|new)Data.sh and settings.py.
The main idea is to use comments # to turn on/off configurations when training/validating/inferencing FlowER.
run_FlowER_large_(old|new)Data.sh allows user to specify your data folder name, experiment name, gpu configuration and choose which scripts to run.
settings.py allows user to specify different configurations for different workflows.
Ensure that data/ folder is populated accordingly and run_FlowER_large_(old|new)Data.sh is pointing to the correct files.
export TRAIN_FILE=$PWD/data/$DATA_NAME/train.txt
export VAL_FILE=$PWD/data/$DATA_NAME/val.txt
Check run_FlowER_large_(old|new)Data.sh has scripts/train.sh uncommented.
$ sh run_FlowER_large_(old|new)Data.shYou can validate FlowER on the validation set. Then, in settings.py, ensure these are uncommented.
# validation #
do_validate = True
steps2validate = ["1050000", "1320000", "1500000", "930000", "1020000"]
steps2validate refers to the checkpoints that are selected based on train logs situated at the /logs folder.
Check run_FlowER_large_(old|new)Data.sh has scripts/eval.sh uncommented.
$ sh run_FlowER_large_(old|new)Data.shYou can validate FlowER on the test set. Then, in settings.py, specify your checkpoint at MODEL_NAME and ensure these are uncommented.
# inference #
do_validate = False
Check run_FlowER_large_(old|new)Data.sh has scripts/eval.sh uncommented.
$ sh run_FlowER_large_(old|new)Data.shFlowER takes in atom-mapped reaction as input for training, validation and testing. Each of this elementary reaction steps that is trained on FlowER can be grouped together using sequence index during evaluation when running sequence_evaluation.py.
An elementary reaction step reaction follows the format of mapped_reaction|sequence_idx. Examples are as follows:
[Cl:1][S:2]([Cl:3])=[O:4].[Cl:5][C:6]1=[N:7][S:8][C:9]([C:10](=[O:11])[O:12][H:15])=[C:13]1[Cl:14]>>[Cl:1][S:2]([Cl:3])([O-:4])[O:11][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15]|11831
[Cl:1][S:2]([Cl:3])([O-:4])[O:11][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15]>>[Cl-:1].[S:2]([Cl:3])(=[O:4])[O:11][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15]|11831
[Cl-:1].[S:2]([Cl:3])(=[O:4])[O:11][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15]>>[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)([O:11][S:2]([Cl:3])=[O:4])[O:12][H:15]|11831
[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)([O:11][S:2]([Cl:3])=[O:4])[O:12][H:15]>>[Cl-:3].[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15].[S:2](=[O:4])=[O:11]|11831
[Cl-:3].[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O+:12][H:15].[S:2](=[O:4])=[O:11]>>[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O:12].[Cl:3][H:15].[S:2](=[O:4])=[O:11]|11831
[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O:12].[Cl:3][H:15].[S:2](=[O:4])=[O:11]>>[Cl:1][C:10]([C:9]1=[C:13]([Cl:14])[C:6]([Cl:5])=[N:7][S:8]1)=[O:12].[Cl:3][H:15].[S:2](=[O:4])=[O:11]|11831
Train/Valid/Test hyperparameters
emb_dim- Embedding dimension size of atom embeddingsenc_num_layers- Number of transformer layers to be appliedenc_heads- Number of attention headsenc_filter_size- Dimension of Feed-Forward Network in Transformer block(attn)_dropout- Dropout for Transformer block (0.0 empirically works well)sigma- Standard deviation of Gaussian noise added for reparameterizing the bond-electron (BE) matrix
lr- Learning rate for training (NoamLR)warmup- Warmup steps before LR decay (NoamLR)clip_norm- Gradient clipping threshold to prevent exploding gradientsbeta1,beta2- Adam optimizer’s momentum termseps- Adam optimizer’s denominator term for numerical stabilityweight_decay- L2 regularization strength to prevent overfitting
rbf_low- Radial Basis Function (RBF) centers lowest valuerbf_high- Radial Basis Function (RBF) centers highest valuerbf_gap- Glanularity of RBF centers increment
do_validate- True to trigger validation, False to trigger testingsteps2validate- List of checkpoints to run FlowER on for validationsample_size- Number of samples FlowER generates for evaluation
FlowER mainly uses beam search to seek for plausible mechanistic pathways. Users can input their smiles at data/flower_dataset/beam.txt.
Ensure that in run_FlowER_large_(old|new)Data.sh, the TEST_FILE variable is pointing towards the correct file.
export TEST_FILE=$PWD/data/$DATA_NAME/beam.txt
Ensure that in settings.py, beam search configuration are uncommented and specified accordingly.
test_path = f"data/{DATA_NAME}/beam.txt"
# beam-search #
beam_size = 5
nbest = 3
max_depth = 15
chunk_size = 50
Check run_FlowER_large_(old|new)Data.sh has scripts/search.sh or sh scripts/search_multiGPU.sh uncommented.
$ sh run_FlowER_large_(old|new)Data.shVisualize your route at examples/vis_network.ipynb
FlowER takes in a non atom-mapped reaction for beam search which can be specified in beam.txt
The format of reactants in the file follows reactant>>product1|product2|..., where we can specify multiple major and minor products separated by | in the file
CC(=O)CC(=O)C(F)(F)F.NNc1cccc(Br)c1>>Cc1cc(C(F)(F)F)n(-c2cccc(Br)c2)n1
CC(=O)CC(=O)C(F)(F)F.NNc1cccc(Br)c1>>Cc1cc(C(F)(F)F)n(-c2cccc(Br)c2)n1|Cc1cc(C(F)(F)F)nn1-c1cccc(Br)c1
Search hyperparameters
beam_size- Size of top-k selection of candidates (based on cumulative probability) to be further expanded. Increasing this would make the overall search more comprehensive, but at the cost of slower runtime.nbest- Cut-off size of the top-k outcomes generated by FlowER after the expan- sion. This cutoff can filter out unlikely outcomes to be part of the selection.sample_size- Number of samples FlowER generates for evaluationmax_depth- Refers to the maximum depth the beam search should explore.chunk_size- Number of reactants sets to be run beam search concurrently.
@article{joung2025electron,
title={Electron flow matching for generative reaction mechanism prediction obeying conservation laws},
author={Joung, Joonyoung F and Fong, Mun Hong and Casetti, Nicholas and Liles, Jordan P and Dassanayake, Ne S and Coley, Connor W},
journal={arXiv preprint arXiv:2502.12979},
year={2025}
}