MolADT represents molecules as typed data for Bayesian modelling, feature generation, inverse design, validation, and viewing.
The core object is not just a string and not just a graph. It keeps:
- atoms
- coordinates
- bonding systems
- formal charge
- optional shell/orbital data
- an edge set derived from bonding-system member edges
Those fields can be inspected, mutated, scored, serialized, and shared with the Haskell repo.
Quickstart · Representation · Parsing · Validator · Examples · Equality · CLI · Benchmarking Models · Benchmarks · Outputs
| Task | Command | Output |
|---|---|---|
| View examples | make view |
Browser viewer for built-in MolADT examples |
| FreeSolv benchmark | make freesolv |
Bayesian GP benchmark for hydration free energy |
| FreeSolv 20 split | make freesolv-20split |
Repeated-split check for the 30-feature full MolADT GP |
| FreeSolv ablation | make freesolv-ablation |
Representation ablation for atom bag, SMILES adjacency graph, and full MolADT components |
| FreeSolv inverse design | make inverse-design TARGET=-5.0 |
1,000 generated candidates, ranked by Bayesian credible score |
| QM9 benchmark | make qm9long |
QM9 dipole moment mu benchmark using geometry features |
| Timing benchmark | make timing |
ZINC representation timing comparison |
The default FreeSolv model is moladt_full30_rbf_gp, an exact Gaussian process
over a deliberately small, listable feature set:
- 30 features now come from MolADT-native descriptors: composition and polarity signals, explicit bonding-system counts, effective bond-order summaries, delocalised/multicentre channels, and short-range radial structure
- no Weisfeiler-Lehman tokens, hashed fingerprints, or large sparse feature vocabularies are used
- the empirical-Bayes fit optimizes the RBF kernel signal variance, lengthscale, and observation-noise variance, then gives posterior predictive means and standard deviations in kcal/mol
make freesolv writes the current single-split artifact under
results/freesolv/run_<timestamp>/ and removes older FreeSolv run_*
directories first. make freesolv-20split writes repeated-split uncertainty
outputs under results/freesolv_small_feature_gp/run_<timestamp>/ with the
same cleanup policy for that folder.
The feature manifest for each run is written to feature_manifest.csv.
The kernel is an RBF GP over standardized small descriptors. This keeps the model interpretable and avoids a high-dimensional token vocabulary on the 642-molecule FreeSolv dataset.
The fitted kernel is:
k(x, x') =
signal_variance * exp(-||z(x) - z(x')||^2 / (2 * lengthscale^2))
The signal variance, lengthscale, and observation-noise variance are optimized by empirical Bayes on the training split. The model then uses exact GP conditioning to produce a predictive mean and standard deviation for hydration free energy.
The earlier strong WL-token GP is also available as an additional model:
FREESOLV_MODELS=moladt_full30_rbf_gp,gp_wl make freesolv
FREESOLV_20SPLIT_MODEL=gp_wl make freesolv-20splitgp_wl uses parsed MolADT SDF molecules, bonding-system tokens, and
Weisfeiler-Lehman atom labels that include formal charge plus shell/orbital
occupancy.
The main FreeSolv evidence is the A/B/C small-feature ablation. It asks whether the MolADT multigraph feature panel improves an exact RBF GP beyond an atom bag and a graph-only baseline while keeping the whole feature set listable.
Run it with:
make freesolv-ablationThe ladder is deliberately compact:
- A uses ten SMILES atom-count features only
- B uses twenty SMILES-decoded graph features: atom counts, bond-order counts, heavy-atom count, component count, cycle rank, and heavy-degree summaries
- C uses thirty MolADT-native descriptors: size, polarity, donor/acceptor counts, explicit bonding-system/effective-order summaries, ring and rotatable-bond structure, and short-range radial descriptors
- no WL tokens, hashed fingerprints, or large sparse feature vocabularies are used
The latest committed 20-split result before the multigraph feature redo was:
| Label | Variant | Meaning | Test RMSE |
|---|---|---|---|
| A | atom bag | 10 atom-count features | 1.971 +/- 0.567 |
| B | SMILES adjacency graph | 20 graph-only features | 1.791 +/- 0.505 |
| C | full MolADT | previous 20 graph features plus 10 MolADT descriptors | 1.308 +/- 0.461 |
Re-run make freesolv-ablation before citing a C-row RMSE for the current
multigraph-first feature contract.
make python-setup
make python-parse
make viewFor the default FreeSolv model and inverse-design loop:
make freesolv
make inverse-design TARGET=-5.0Setup artifacts stay inside this checkout:
make python-setupcreates./.venvmake python-cmdstan-installcreates./.cmdstanonly when you explicitly run legacy Stan model overrides
MolADT gives Bayesian molecular work a typed support where the same molecule object is used by:
- priors
- proposal kernels
- validators
- feature maps
- posterior predictive scores
- exported candidates
That matters for inverse design because the FreeSolv task can:
- start from a FreeSolv-derived molecule prior
- grow typed candidates
- score each molecule with the unchanged GP posterior predictive distribution
- write the exact generated ADT back to disk
The FreeSolv model is posterior predictive:
- it gives a distribution over hydration free energy for a given molecule
- it is not itself a molecule-generating prior
The core molecule shape mirrors the sibling Haskell repo:
data Molecule = Molecule
{ atoms :: Map AtomId Atom
, systems :: [(SystemId, BondingSystem)]
, smilesStereochemistry :: SmilesStereochemistry
}@dataclass(frozen=True, slots=True)
class Molecule:
atoms: Mapping[AtomId, Atom]
systems: tuple[tuple[SystemId, BondingSystem], ...] = ()
smiles_stereochemistry: SmilesStereochemistry = field(default_factory=SmilesStereochemistry)The canonical Dietz bonding layer is systems:
- every edge is an instance of a
BondingSystem - a single covalent bond is a one-edge
2esystem - a double covalent bond is a one-edge
4esystem - a triple covalent bond is a one-edge
6esystem - a quadruple covalent bond is a one-edge
8esystem - an ionic contact is a one-edge
0esystem taggedionic - formal charge stays on atoms rather than on the edge
For example, sodium chloride stores:
Na#1as+1Cl#2as-1- the Na-Cl edge as one
ionicsystem
Pretty printers and viewers derive display from the bonding systems:
- ordinary
2e,4e,6e, and8eone-edge systems display assingle covalent,double covalent,triple covalent, andquadruple covalent - edge rows show the total electrons shared over each edge
- edge rows also show the effective order
- a benzene C-C edge displays as
shared=3eandorder=1.50 - that benzene value is
2efrom the unnamed one-edge system plus1e/edgefrom the six-electronpi_ring
System identifiers are stable display IDs, not chemistry:
- checked examples and parsers put named or multi-edge systems first
- benzene uses
SystemId(1)forpi_ring - ordinary one-edge covalent systems are then numbered after it
Use same_molecule when you want equality modulo:
- atom-map ordering
- system-tuple ordering
- member-edge set ordering
- annotation tuple ordering
It keeps atom and system identifiers meaningful.
An atom carries element data, position, formal charge, and shell data:
@dataclass(frozen=True, slots=True)
class Atom:
atom_id: AtomId
attributes: ElementAttributes
coordinate: Coordinate
shells: Shells | None = None
formal_charge: int = 0ElementAttributes also carries the default shell data:
- simple atom builders can use
element_attributes(symbol) - atom builders can omit
shells - there is no separate shell lookup layer
- all 118 official elements are present for atomic number and mass; elements
outside the audited shell subset keep
shells=None - audited default shell tables are regression-tested against neutral atomic electron counts and representative orbital occupancy signatures
Delocalised and multicentre bonding is represented explicitly:
@dataclass(frozen=True, slots=True)
class BondingSystem:
shared_electrons: NonNegative
member_atoms: frozenset[AtomId]
member_edges: frozenset[Edge]
tag: str | None = NoneExamples:
| Molecule | What MolADT stores |
|---|---|
| Benzene | single covalent one-edge systems on each edge plus a six-electron pi_ring; each C-C edge displays as shared=3e |
| Diborane | four terminal B-H single covalent systems plus two explicit 3c-2e bridge systems; no direct B-B singleton |
| Ferrocene | Cp/C-H single covalent systems plus two Fe-centred Cp delocalised systems; Fe is +2 and one representative carbon on each Cp ring is -1 |
| Sodium chloride | Na+ and Cl- atoms plus one zero-electron ionic edge system |
| Morphine | every graph edge as a system, including a double covalent alkene edge, plus a phenyl delocalised system |
Ferrocene is a useful example because the metallocene structure stays explicit:
- it is not flattened into a string
- each Cp delocalised system spans five Cp ring C-C edges
- each Cp delocalised system also spans five Fe-C contacts to the central iron
- overlapping delocalised and covalent systems get separate dashed lanes in the viewer
- ordinary Cp/C-H and Cp/Cp covalent edges are still present as unnamed
one-edge systems that display as
single covalent
| System | Shared electrons | Member edges |
|---|---|---|
cp1_pi |
6e |
the five C-C edges in the first Cp ring plus the five Fe-C contacts to that ring |
cp2_pi |
6e |
the five C-C edges in the second Cp ring plus the five Fe-C contacts to that ring |
See the full expanded ADT in moladt/examples/ferrocene.py.
validate_molecule(molecule) is a representation validator, not a physical
chemistry oracle. It rejects malformed MolADT values before parsing, viewing,
feature extraction, or inverse-design scoring continue.
It checks:
- atom map keys match the atom IDs
- coordinates and element metadata are finite
SystemIds are positive and unique- bonding systems are non-empty
- bonding-system edges reference existing atoms
- cached member atoms match the member edges
- duplicate bonding systems are not present
- SMILES stereochemistry annotations only point at known atoms
- ordinary one-edge covalent systems with
2,4,6, or8shared electrons are unnamed and display as single/double/triple/quadruple covalent bonds - one-edge
0esystems are taggedionic ionicsystems share zero electrons over exactly one edge
It deliberately does not:
- prove a molecule is physically realistic
- choose protonation states
- infer missing hydrogens
- decide whether a delocalised system is chemically preferred
Task-specific checks can add those rules separately. The FreeSolv inverse-design task now runs this generic validator before its own FreeSolv geometry and generation-contract checks.
make view
make molecule-viewer VIEWER_EXAMPLES="benzene diborane ferrocene"
make python-pretty-example EXAMPLE=morphinemake view opens seven built-in examples in one browser page.
The viewer renders:
- sodium chloride's charged
0eionic edge - positive and negative formal charge as blue and red halos
- charge halo size and opacity proportional to formal-charge magnitude
- an extra halo boost for atoms in an ionic bonding system
- ionic edges as blue-to-red gradients between charged atoms
- ordinary covalent edges in dark grey
- single covalent bonds as one line
- double covalent bonds as two lines around the edge axis
- triple covalent bonds as three lines
- quadruple covalent bonds as four lines
- overlapping bonding systems as separate dashed lanes, including ordinary covalent versus delocalised overlap in ferrocene
- non-standard systems such as pi rings, bridge bonds, and coordination as coloured dashed overlays
The right panel lists:
- all bonding systems
single covalent,double covalent,triple covalent, andquadruple covalentone-edge systems- selected atom coordinates
- 3D edge lengths
- effective orders
- bonding systems and bond angles from molecule coordinates
Bonding-system text labels are not drawn over the molecule.
Viewer opening behavior:
- use
OPEN_VIEWER=1to open generated viewer pages automatically - viewer commands also print a portable
file://URL - if the operating system does not open a browser, open the reported URL manually
OPEN_VIEWER=1 make molecule-viewer VIEWER_EXAMPLES=diborane
OPEN_VIEWER=1 make inverse-design TARGET=-5.0Run FreeSolv and inverse design together:
make freesolv
make inverse-design TARGET=-5.0
make inverse-design-viewmake inverse-design:
- uses the latest
results/freesolv/run_*Bayesian GP artifact - samples initial chains from the MolADT FreeSolv prior
- reweights that prior with the unchanged GP target likelihood
- generates 1,000 candidates
- writes the top 10 by Bayesian credible score
Output notes:
- geometry values are audited in the outputs
- the sampler is not conditioned on physical plausibility
- generated reference outputs live in
results/inverse_design/reference/
| Benchmark | Target | Main command |
|---|---|---|
| FreeSolv | hydration free energy | make freesolv |
| FreeSolv 20 split | hydration free energy | make freesolv-20split |
| FreeSolv inverse design | target hydration free energy | make inverse-design TARGET=-5.0 |
| QM9 | dipole moment mu |
make qm9long |
| ZINC timing | representation throughput | make timing |
Benchmark details are in Inference and benchmarks, Benchmarking models and features, Outputs, and results README.
MolADT JSON is the boundary shared by the Python and Haskell repos:
- Python writes MolADT JSON
- Haskell reads the same MolADT JSON shape
- atom IDs, system IDs, bonding systems, charges, and stereochemistry remain explicit
./.venv/bin/python - <<'PY' > morphine.moladt.json
from moladt.examples import morphine_pretty
from moladt.io import molecule_to_json
print(molecule_to_json(morphine_pretty))
PY
stack run moladtbayes -- from-json ../MolADT-Bayes-Python/morphine.moladt.jsonPython also writes standardized benchmark matrices that Haskell can consume:
./.venv/bin/python -m benchmarking.run_all freesolv
stack run moladtbayes -- infer-benchmark freesolv_moladt_featurized mh:0.2| Path | Purpose |
|---|---|
moladt/ |
molecule ADT, validation, parsers, renderers, viewer, examples |
experiments/ |
FreeSolv inverse design |
benchmarking/ |
data processing, feature generation, benchmark model runs, reporting |
stan/ |
Bayesian model definitions |
data/ |
vendored and processed benchmark data |
results/ |
committed reference outputs and local run artifacts |
docs/ |
modular documentation |

