Thanks to visit codestin.com
Credit goes to github.com

Skip to content

oliverjgoldstein/MolADT-Bayes-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

122 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolADT-Bayes-Python

MolADT represents molecules as typed data for Bayesian modelling, feature generation, inverse design, validation, and viewing.

The core object is not just a string and not just a graph. It keeps:

  • atoms
  • coordinates
  • bonding systems
  • formal charge
  • optional shell/orbital data
  • an edge set derived from bonding-system member edges

Those fields can be inspected, mutated, scored, serialized, and shared with the Haskell repo.

Ferrocene in the MolADT viewer Diborane in the MolADT viewer

Quickstart · Representation · Parsing · Validator · Examples · Equality · CLI · Benchmarking Models · Benchmarks · Outputs

What It Does

Task Command Output
View examples make view Browser viewer for built-in MolADT examples
FreeSolv benchmark make freesolv Bayesian GP benchmark for hydration free energy
FreeSolv 20 split make freesolv-20split Repeated-split check for the 30-feature full MolADT GP
FreeSolv ablation make freesolv-ablation Representation ablation for atom bag, SMILES adjacency graph, and full MolADT components
FreeSolv inverse design make inverse-design TARGET=-5.0 1,000 generated candidates, ranked by Bayesian credible score
QM9 benchmark make qm9long QM9 dipole moment mu benchmark using geometry features
Timing benchmark make timing ZINC representation timing comparison

FreeSolv GP Model

The default FreeSolv model is moladt_full30_rbf_gp, an exact Gaussian process over a deliberately small, listable feature set:

  • 30 features now come from MolADT-native descriptors: composition and polarity signals, explicit bonding-system counts, effective bond-order summaries, delocalised/multicentre channels, and short-range radial structure
  • no Weisfeiler-Lehman tokens, hashed fingerprints, or large sparse feature vocabularies are used
  • the empirical-Bayes fit optimizes the RBF kernel signal variance, lengthscale, and observation-noise variance, then gives posterior predictive means and standard deviations in kcal/mol

make freesolv writes the current single-split artifact under results/freesolv/run_<timestamp>/ and removes older FreeSolv run_* directories first. make freesolv-20split writes repeated-split uncertainty outputs under results/freesolv_small_feature_gp/run_<timestamp>/ with the same cleanup policy for that folder.

Feature Map

The feature manifest for each run is written to feature_manifest.csv.

Kernel Choice

The kernel is an RBF GP over standardized small descriptors. This keeps the model interpretable and avoids a high-dimensional token vocabulary on the 642-molecule FreeSolv dataset.

The fitted kernel is:

k(x, x') =
  signal_variance * exp(-||z(x) - z(x')||^2 / (2 * lengthscale^2))

The signal variance, lengthscale, and observation-noise variance are optimized by empirical Bayes on the training split. The model then uses exact GP conditioning to produce a predictive mean and standard deviation for hydration free energy.

The earlier strong WL-token GP is also available as an additional model:

FREESOLV_MODELS=moladt_full30_rbf_gp,gp_wl make freesolv
FREESOLV_20SPLIT_MODEL=gp_wl make freesolv-20split

gp_wl uses parsed MolADT SDF molecules, bonding-system tokens, and Weisfeiler-Lehman atom labels that include formal charge plus shell/orbital occupancy.

Representation Ablation

The main FreeSolv evidence is the A/B/C small-feature ablation. It asks whether the MolADT multigraph feature panel improves an exact RBF GP beyond an atom bag and a graph-only baseline while keeping the whole feature set listable.

Run it with:

make freesolv-ablation

The ladder is deliberately compact:

  • A uses ten SMILES atom-count features only
  • B uses twenty SMILES-decoded graph features: atom counts, bond-order counts, heavy-atom count, component count, cycle rank, and heavy-degree summaries
  • C uses thirty MolADT-native descriptors: size, polarity, donor/acceptor counts, explicit bonding-system/effective-order summaries, ring and rotatable-bond structure, and short-range radial descriptors
  • no WL tokens, hashed fingerprints, or large sparse feature vocabularies are used

The latest committed 20-split result before the multigraph feature redo was:

Label Variant Meaning Test RMSE
A atom bag 10 atom-count features 1.971 +/- 0.567
B SMILES adjacency graph 20 graph-only features 1.791 +/- 0.505
C full MolADT previous 20 graph features plus 10 MolADT descriptors 1.308 +/- 0.461

Re-run make freesolv-ablation before citing a C-row RMSE for the current multigraph-first feature contract.

Start

make python-setup
make python-parse
make view

For the default FreeSolv model and inverse-design loop:

make freesolv
make inverse-design TARGET=-5.0

Setup artifacts stay inside this checkout:

  • make python-setup creates ./.venv
  • make python-cmdstan-install creates ./.cmdstan only when you explicitly run legacy Stan model overrides

Why The ADT Matters

MolADT gives Bayesian molecular work a typed support where the same molecule object is used by:

  • priors
  • proposal kernels
  • validators
  • feature maps
  • posterior predictive scores
  • exported candidates

That matters for inverse design because the FreeSolv task can:

  • start from a FreeSolv-derived molecule prior
  • grow typed candidates
  • score each molecule with the unchanged GP posterior predictive distribution
  • write the exact generated ADT back to disk

The FreeSolv model is posterior predictive:

  • it gives a distribution over hydration free energy for a given molecule
  • it is not itself a molecule-generating prior

The Representation

The core molecule shape mirrors the sibling Haskell repo:

data Molecule = Molecule
  { atoms      :: Map AtomId Atom
  , systems    :: [(SystemId, BondingSystem)]
  , smilesStereochemistry :: SmilesStereochemistry
  }
@dataclass(frozen=True, slots=True)
class Molecule:
    atoms: Mapping[AtomId, Atom]
    systems: tuple[tuple[SystemId, BondingSystem], ...] = ()
    smiles_stereochemistry: SmilesStereochemistry = field(default_factory=SmilesStereochemistry)

The canonical Dietz bonding layer is systems:

  • every edge is an instance of a BondingSystem
  • a single covalent bond is a one-edge 2e system
  • a double covalent bond is a one-edge 4e system
  • a triple covalent bond is a one-edge 6e system
  • a quadruple covalent bond is a one-edge 8e system
  • an ionic contact is a one-edge 0e system tagged ionic
  • formal charge stays on atoms rather than on the edge

For example, sodium chloride stores:

  • Na#1 as +1
  • Cl#2 as -1
  • the Na-Cl edge as one ionic system

Pretty printers and viewers derive display from the bonding systems:

  • ordinary 2e, 4e, 6e, and 8e one-edge systems display as single covalent, double covalent, triple covalent, and quadruple covalent
  • edge rows show the total electrons shared over each edge
  • edge rows also show the effective order
  • a benzene C-C edge displays as shared=3e and order=1.50
  • that benzene value is 2e from the unnamed one-edge system plus 1e/edge from the six-electron pi_ring

System identifiers are stable display IDs, not chemistry:

  • checked examples and parsers put named or multi-edge systems first
  • benzene uses SystemId(1) for pi_ring
  • ordinary one-edge covalent systems are then numbered after it

Use same_molecule when you want equality modulo:

  • atom-map ordering
  • system-tuple ordering
  • member-edge set ordering
  • annotation tuple ordering

It keeps atom and system identifiers meaningful.

An atom carries element data, position, formal charge, and shell data:

@dataclass(frozen=True, slots=True)
class Atom:
    atom_id: AtomId
    attributes: ElementAttributes
    coordinate: Coordinate
    shells: Shells | None = None
    formal_charge: int = 0

ElementAttributes also carries the default shell data:

  • simple atom builders can use element_attributes(symbol)
  • atom builders can omit shells
  • there is no separate shell lookup layer
  • all 118 official elements are present for atomic number and mass; elements outside the audited shell subset keep shells=None
  • audited default shell tables are regression-tested against neutral atomic electron counts and representative orbital occupancy signatures

Delocalised and multicentre bonding is represented explicitly:

@dataclass(frozen=True, slots=True)
class BondingSystem:
    shared_electrons: NonNegative
    member_atoms: frozenset[AtomId]
    member_edges: frozenset[Edge]
    tag: str | None = None

Examples:

Molecule What MolADT stores
Benzene single covalent one-edge systems on each edge plus a six-electron pi_ring; each C-C edge displays as shared=3e
Diborane four terminal B-H single covalent systems plus two explicit 3c-2e bridge systems; no direct B-B singleton
Ferrocene Cp/C-H single covalent systems plus two Fe-centred Cp delocalised systems; Fe is +2 and one representative carbon on each Cp ring is -1
Sodium chloride Na+ and Cl- atoms plus one zero-electron ionic edge system
Morphine every graph edge as a system, including a double covalent alkene edge, plus a phenyl delocalised system

Ferrocene is a useful example because the metallocene structure stays explicit:

  • it is not flattened into a string
  • each Cp delocalised system spans five Cp ring C-C edges
  • each Cp delocalised system also spans five Fe-C contacts to the central iron
  • overlapping delocalised and covalent systems get separate dashed lanes in the viewer
  • ordinary Cp/C-H and Cp/Cp covalent edges are still present as unnamed one-edge systems that display as single covalent
System Shared electrons Member edges
cp1_pi 6e the five C-C edges in the first Cp ring plus the five Fe-C contacts to that ring
cp2_pi 6e the five C-C edges in the second Cp ring plus the five Fe-C contacts to that ring

See the full expanded ADT in moladt/examples/ferrocene.py.

Validator

validate_molecule(molecule) is a representation validator, not a physical chemistry oracle. It rejects malformed MolADT values before parsing, viewing, feature extraction, or inverse-design scoring continue.

It checks:

  • atom map keys match the atom IDs
  • coordinates and element metadata are finite
  • SystemIds are positive and unique
  • bonding systems are non-empty
  • bonding-system edges reference existing atoms
  • cached member atoms match the member edges
  • duplicate bonding systems are not present
  • SMILES stereochemistry annotations only point at known atoms
  • ordinary one-edge covalent systems with 2, 4, 6, or 8 shared electrons are unnamed and display as single/double/triple/quadruple covalent bonds
  • one-edge 0e systems are tagged ionic
  • ionic systems share zero electrons over exactly one edge

It deliberately does not:

  • prove a molecule is physically realistic
  • choose protonation states
  • infer missing hydrogens
  • decide whether a delocalised system is chemically preferred

Task-specific checks can add those rules separately. The FreeSolv inverse-design task now runs this generic validator before its own FreeSolv geometry and generation-contract checks.

View Molecules

make view
make molecule-viewer VIEWER_EXAMPLES="benzene diborane ferrocene"
make python-pretty-example EXAMPLE=morphine

make view opens seven built-in examples in one browser page.

The viewer renders:

  • sodium chloride's charged 0e ionic edge
  • positive and negative formal charge as blue and red halos
  • charge halo size and opacity proportional to formal-charge magnitude
  • an extra halo boost for atoms in an ionic bonding system
  • ionic edges as blue-to-red gradients between charged atoms
  • ordinary covalent edges in dark grey
  • single covalent bonds as one line
  • double covalent bonds as two lines around the edge axis
  • triple covalent bonds as three lines
  • quadruple covalent bonds as four lines
  • overlapping bonding systems as separate dashed lanes, including ordinary covalent versus delocalised overlap in ferrocene
  • non-standard systems such as pi rings, bridge bonds, and coordination as coloured dashed overlays

The right panel lists:

  • all bonding systems
  • single covalent, double covalent, triple covalent, and quadruple covalent one-edge systems
  • selected atom coordinates
  • 3D edge lengths
  • effective orders
  • bonding systems and bond angles from molecule coordinates

Bonding-system text labels are not drawn over the molecule.

Viewer opening behavior:

  • use OPEN_VIEWER=1 to open generated viewer pages automatically
  • viewer commands also print a portable file:// URL
  • if the operating system does not open a browser, open the reported URL manually
OPEN_VIEWER=1 make molecule-viewer VIEWER_EXAMPLES=diborane
OPEN_VIEWER=1 make inverse-design TARGET=-5.0

FreeSolv Inverse Design

Run FreeSolv and inverse design together:

make freesolv
make inverse-design TARGET=-5.0
make inverse-design-view

make inverse-design:

  • uses the latest results/freesolv/run_* Bayesian GP artifact
  • samples initial chains from the MolADT FreeSolv prior
  • reweights that prior with the unchanged GP target likelihood
  • generates 1,000 candidates
  • writes the top 10 by Bayesian credible score

Output notes:

  • geometry values are audited in the outputs
  • the sampler is not conditioned on physical plausibility
  • generated reference outputs live in results/inverse_design/reference/

Benchmarks

Benchmark Target Main command
FreeSolv hydration free energy make freesolv
FreeSolv 20 split hydration free energy make freesolv-20split
FreeSolv inverse design target hydration free energy make inverse-design TARGET=-5.0
QM9 dipole moment mu make qm9long
ZINC timing representation throughput make timing

Benchmark details are in Inference and benchmarks, Benchmarking models and features, Outputs, and results README.

Python And Haskell

MolADT JSON is the boundary shared by the Python and Haskell repos:

  • Python writes MolADT JSON
  • Haskell reads the same MolADT JSON shape
  • atom IDs, system IDs, bonding systems, charges, and stereochemistry remain explicit
./.venv/bin/python - <<'PY' > morphine.moladt.json
from moladt.examples import morphine_pretty
from moladt.io import molecule_to_json

print(molecule_to_json(morphine_pretty))
PY

stack run moladtbayes -- from-json ../MolADT-Bayes-Python/morphine.moladt.json

Python also writes standardized benchmark matrices that Haskell can consume:

./.venv/bin/python -m benchmarking.run_all freesolv
stack run moladtbayes -- infer-benchmark freesolv_moladt_featurized mh:0.2

Repo Map

Path Purpose
moladt/ molecule ADT, validation, parsers, renderers, viewer, examples
experiments/ FreeSolv inverse design
benchmarking/ data processing, feature generation, benchmark model runs, reporting
stan/ Bayesian model definitions
data/ vendored and processed benchmark data
results/ committed reference outputs and local run artifacts
docs/ modular documentation

About

A Library to Represent Molecules (for use with probabilistic programming) with Algebraic Data Types, maintainer [email protected]

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors