Thanks to visit codestin.com
Credit goes to github.com

Skip to content

HasiHays/HMLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Molecular Language Models (HMLMs)

License: MIT Python 3.8+ arXiv

Overview

Hierarchical Molecular Language Models (HMLMs) represent a paradigm shift in computational systems biology by treating cellular signaling networks as specialized "molecular languages". This framework adapts transformer architectures to model complex biological signaling across multiple scales from individual molecules to cellular responses, enabling superior predictive accuracy and mechanistic biological insights.

HMLMs achieve 30% improvement over Graph Neural Networks (GNNs) and 52% improvement over Ordinary Differential Equation (ODE) models in temporal signaling prediction, while maintaining robust performance under sparse temporal sampling conditions.

Key Features

  • Multi-scale Integration: Hierarchical processing of molecular, pathway, and cellular-level information
  • Graph-Structured Attention: Adapted transformer mechanisms for network topology accommodation
  • Temporal Dynamics: Explicit modeling of signaling kinetics across multiple timescales
  • Multi-modal Data Fusion: Integration of phosphoproteomics, transcriptomics, imaging, and perturbation data
  • Interpretable Attention Mechanisms: Identifies biologically meaningful cross-talk patterns and regulatory relationships
  • Sparse Data Handling: Superior performance with limited temporal sampling (MSE = 0.042 with only 4 timepoints)

Results Summary

Model Performance Comparison

Temporal Resolution HNLM GNN ODE LDE Bayesian Networks
4 timepoints (sparse) 0.042 0.049 0.121 0.071 0.071
8 timepoints (medium) 0.050 0.079 0.120 0.091 0.091
16 timepoints (high) 0.056 0.087 0.121 0.096 0.095

Improvements over Baseline Methods

HNLM demonstrates significant performance advantages across all temporal resolutions:

Sparse Data (4 Timepoints)

  • 14% improvement over GNN (0.042 vs 0.049 MSE)
  • 41% improvement over LDE (0.042 vs 0.071 MSE)
  • 41% improvement over Bayesian Networks (0.042 vs 0.071 MSE)
  • 65% improvement over ODE (0.042 vs 0.121 MSE)

Medium Resolution (8 Timepoints)

  • 37% improvement over GNN (0.050 vs 0.079 MSE)
  • 45% improvement over LDE (0.050 vs 0.091 MSE)
  • 45% improvement over Bayesian Networks (0.050 vs 0.091 MSE)
  • 58% improvement over ODE (0.050 vs 0.120 MSE)

High Resolution (16 Timepoints)

  • 36% improvement over GNN (0.056 vs 0.087 MSE)
  • 42% improvement over LDE (0.056 vs 0.096 MSE)
  • 41% improvement over Bayesian Networks (0.056 vs 0.095 MSE)
  • 54% improvement over ODE (0.056 vs 0.121 MSE)

Key Performance Characteristics

  • Superior sparse data handling: Maintains robust performance with minimal timepoints
  • Consistent improvement: Outperforms all baseline methods across all temporal resolutions
  • Scalability: Performance remains optimal even with limited training data
  • Network-aware predictions: Leverages cardiac fibroblast signaling network topology for enhanced accuracy

Installation

Requirements

  • Python 3.8 or higher
  • PyTorch 1.9+
  • NumPy 1.19+
  • Pandas 1.1+
  • Scikit-learn 0.24+
  • NetworkX 2.5+

Setup

# Clone the repository
git clone https://github.com/HasiHays/HMLMs.git
cd HMLMs

# Create virtual environment (recommended)
python -m venv hmlm_env
source hmlm_env/bin/activate  # On Windows: hmlm_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

from hmlm import HMLM, CardiacFibroblastNetwork

# Initialize network
network = CardiacFibroblastNetwork()

# Create and train model
model = HMLM(
    network_topology=network.graph,
    num_scales=3,
    attention_heads=8,
    hidden_dim=512
)

# Train on signaling data
model.train(
    temporal_data=signaling_data,
    num_epochs=50,
    learning_rate=0.001
)

# Predict cellular responses
predictions = model.predict(
    initial_conditions=conditions,
    time_steps=100
)

# Extract biological insights
attention_patterns = model.analyze_attention()
cross_talk = model.identify_pathway_crosstalk()

Usage Examples

Basic Model Training

See HMLMs_main.ipynb for comprehensive examples including:

  • Data preprocessing and normalization
  • Multi-scale embedding generation
  • Model training and validation
  • Attention mechanism visualization

Cardiac Fibroblast Signaling Prediction

# Define experimental conditions
conditions = {
    'control': {'TGF_beta': 0, 'mechanical_strain': 0},
    'TGF_beta': {'TGF_beta': 1.0, 'mechanical_strain': 0},
    'strain': {'TGF_beta': 0, 'mechanical_strain': 1.0},
    'combined': {'TGF_beta': 1.0, 'mechanical_strain': 1.0}
}

# Predict dynamics across conditions
for condition_name, stim in conditions.items():
    dynamics = model.predict_dynamics(stim, time_steps=100)
    # Analyze fibrosis markers, contractility, etc.

Attention-Based Pathway Analysis

# Extract molecular-scale attention weights
mol_attention = model.get_molecular_attention()

# Identify high-confidence regulatory edges
significant_edges = mol_attention[mol_attention > 0.7]

# Pathway-level cross-talk quantification
pathway_crosstalk = model.quantify_cross_talk()
print(f"TGF-β to MAPK cross-talk: {pathway_crosstalk['TGFb_MAPK']:.3f}")

Architecture

Core Components

  1. Information Transducers: Fundamental units modeling biological entities (proteins, complexes, pathways)

    • Input Space: External signals and stimuli
    • State Space: Internal configurations (conformational, activation, binding, modification, localization states)
    • Output Space: Signal transformations and functional responses
  2. Graph-Structured Attention: Adapted transformer mechanisms respecting network topology

    • Node embeddings incorporating entity type, features, and graph positional information
    • Multi-head attention operating within network neighborhoods
    • Temporal attention capturing signal propagation delays
  3. Scale-Bridging Operators:

    • Aggregation (Ω↑): Combines molecular-level information into pathway-level representations
    • Decomposition (Ω↓): Distributes pathway signals to constituent molecules
    • Translation (Ω↔): Converts information between representational formats
  4. Hierarchical Integration:

    • Within-scale attention: Local interactions at each biological scale
    • Cross-scale attention: Information flow across scales
    • Feed-forward networks: Non-linear integration

Data Requirements

Input Data Formats

The model accepts multi-modal biological data:

  • Phosphoproteomics: Protein phosphorylation time series (measured via mass spectrometry or immunoassays)
  • Transcriptomics: Gene expression profiles (RNA-seq or qPCR)
  • Imaging Data: Subcellular localization, morphological features (immunofluorescence or live-cell imaging)
  • Perturbation Data: Effects of knockdowns, inhibitors, or genetic modifications

Example Data Structure

# Temporal signaling data
data = {
    'time_points': np.array([0, 1, 5, 10, 30, 60, 120]),  # minutes
    'molecules': {
        'ERK_phos': np.array([0.0, 0.1, 0.8, 0.9, 0.5, 0.2, 0.0]),
        'AKT_phos': np.array([0.0, 0.05, 0.3, 0.5, 0.7, 0.8, 0.9]),
        'p53': np.array([1.0, 1.0, 1.5, 2.0, 1.8, 1.2, 1.0])
    },
    'conditions': 'TGF_beta_stimulation'
}

Model Evaluation

The repository includes comprehensive evaluation tools:

from hmlm.evaluation import evaluate_model, plot_predictions

# Evaluate on test data
metrics = evaluate_model(
    model=trained_model,
    test_data=test_signaling_data,
    metrics=['mse', 'pearson_correlation', 'temporal_resolution']
)

# Visualize predictions vs. observations
plot_predictions(
    model=trained_model,
    data=test_data,
    molecules=['ERK', 'AKT', 'p38'],
    conditions=['control', 'TGF_beta']
)

Mathematical Foundation

Information Transducer Definition

Each biological entity is modeled as an information transducer T = (X, Y, S, f, g):

  • State Transition: s(t+1) = f(x(t), s(t))
  • Output Function: y(t) = g(s(t))

Graph Attention Mechanism

GraphAttention_v(Q, K, V) = softmax(Q_v K^T_{N(v)} / √d_k) V_{N(v)}

where N(v) represents the neighborhood of vertex v in the signaling network.

Temporal Embedding

h^(0)_v(t) = h̃^(0)_v + τ(t)

Combines static node embeddings with learned temporal patterns across multiple timescales.

Publications & Citation

If you use HMLMs in your research, please cite: (Pre-print available at: [arXiv:submit/7034338 [cs.AI] 30 Nov 2025](https://doi.org/10.48550/arXiv.2512.00696))

@misc{https://doi.org/10.48550/arxiv.2512.00696,
  doi = {10.48550/ARXIV.2512.00696},
  url = {https://arxiv.org/abs/2512.00696},
  author = {Hays,  Hasi and Yu,  Yue and Richardson,  William},
  keywords = {Molecular Networks (q-bio.MN),  Artificial Intelligence (cs.AI),  Emerging Technologies (cs.ET),  FOS: Biological sciences,  FOS: Biological sciences,  FOS: Computer and information sciences,  FOS: Computer and information sciences},
  title = {Hierarchical Molecular Language Models (HMLMs)},
  publisher = {arXiv},
  year = {2025},
  copyright = {Creative Commons Attribution 4.0 International}
}

Support & Issues

For bug reports, feature requests, or questions:

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

This work was supported by:

  • National Institutes of Health (NIGMS R01GM157589)
  • Department of Defense (DEPSCoR FA9550-22-1-0379)

Last Updated: 12/08/2025
Maintained By: Hasi Hays ([email protected])

About

Hierarchical Molecular Language Models (HMLMs)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published