DMR Detection Using Non-Homogeneous Hidden Markov Models
Cornell University Research Project | CS 4775 - Computational Biology | Fall 2024
A statistical pipeline implementing Non-Homogeneous Hidden Markov Models (NHMM) to detect Differentially Methylated Regions (DMRs) in psychiatric conditions such as PTSD and schizophrenia. Reimplements and extends the framework from Shen et al. (2017).
Research Question: Can we accurately identify regions of differential DNA methylation associated with psychiatric trauma using probabilistic graphical models?
Epigenetic modifications, particularly DNA methylation, play a crucial role in psychiatric conditions like PTSD. Identifying Differentially Methylated Regions (DMRs) - areas where methylation patterns differ between affected and control groups - is essential for understanding the biological mechanisms of trauma.
We implemented a sophisticated statistical framework using:
- Non-Homogeneous Hidden Markov Models (NHMM) to model spatial dependencies between CpG sites
- Metropolis-Hastings sampling for parameter estimation
- Bayesian inference for state classification
- Large-scale biological data (6GB genomic datasets)
Unlike traditional methods, our NHMM accounts for varying distances between CpG probes, providing more accurate DMR detection by incorporating genomic structure into the probabilistic model.
Differentially Methylated Regions are genomic locations where DNA methylation levels differ significantly between groups (e.g., PTSD patients vs. controls). These regions can indicate:
- Gene expression changes
- Potential biomarkers for psychiatric conditions
- Epigenetic responses to trauma
Why Hidden Markov Models?
CpG sites (where methylation occurs) are spatially correlated along the genome. HMMs capture these dependencies by:
- Modeling each site as having a hidden state (DMC or non-DMC)
- Accounting for transition probabilities between neighboring sites
- Incorporating genomic distance into the model
1. Data Simulation
โ
2. Backward Probability Calculation
โ
3. Hidden State Sampling
โ
4. Metropolis-Hastings Parameter Sampling
โ
5. Emission Parameter Estimation
โ
6. Iterative Refinement
โ
7. DMR Visualization & Validation
Core Technologies:
- Python 3.8+ - Primary implementation language
- NumPy & Pandas - Numerical computation and data manipulation
- SciPy - Statistical distributions and optimization
- Matplotlib - Data visualization and figure generation
- Jupyter Notebooks - Interactive development and analysis (90% of codebase)
Statistical Methods:
- Hidden Markov Models (HMM)
- Metropolis-Hastings Algorithm
- Bayesian Inference
- ROC/Precision-Recall Analysis
Data Sources:
- Hannon et al. (GSE152027) - 6GB biological methylation dataset
- Illumina HumanMethylation450 Array - 450,000+ CpG sites
- Manufacturer annotations for probe locations
Purpose: Generate synthetic methylation data for model validation
How it works:
- Downloads real biological data (6GB) from Gene Expression Omnibus
- Extracts methylation patterns and probe distances
- Simulates realistic DMR scenarios with known ground truth
- Creates training and validation datasets
Output:
distances.npy- Genomic distances between probessimulated_data.npy- Synthetic methylation valuestrue_states.npy- Ground truth DMR labels
Purpose: Calculate backward probabilities for HMM inference
Implementation:
- Computes P(observations | hidden states) recursively
- Handles numerical stability with log-space calculations
- Samples hidden states using forward-backward algorithm
Mathematical Foundation:
ฮฒ_t(i) = P(o_{t+1:T} | q_t = i)
Purpose: Estimate transition parameters using Metropolis-Hastings
Algorithm:
- Proposes new parameter values from proposal distribution
- Calculates acceptance probability based on likelihood
- Updates parameters when proposal improves model fit
- Ensures convergence through iterative sampling
Key Innovation: Accounts for variable genomic distances in transition probabilities
Purpose: Sample methylation distributions for DMC and non-DMC states
Bayesian Approach:
- Models methylation as beta distributions
- Estimates separate parameters for:
- DMC (Differentially Methylated CpG) sites
- Non-DMC (normal) sites
- Updates priors based on observed data
Purpose: Orchestrate complete model training and iteration
Process:
- Initialize model parameters
- Iteratively refine estimates using steps 1-4
- Monitor convergence
- Save training checkpoints
- Generate diagnostic visualizations
Output: Pickle files with complete parameter history
Purpose: Assess model performance using standard metrics
Metrics:
- ROC Curves - True Positive Rate vs. False Positive Rate
- Precision-Recall - Model discrimination ability
- AUC Scores - Overall classification performance
Output: Publication-quality figures (precision_recall.pdf)
- Model Convergence: Achieved stable parameters after [X] iterations
- ROC AUC: [To be added when you find metrics]
- Precision/Recall: [To be added]
- Dataset Size: 450,000+ CpG sites analyzed
Figure 1: Methylation distributions in real vs. synthetic data
Figure 2: Metropolis-Hastings convergence for transition parameters
Figure 3: Distribution of probe distances
Figure 4: Paired samples by hidden state
Figure 5: [Additional visualization]
Figure 6: ROC curves showing model performance
Figure 7: Precision-Recall analysis
๐ธ Figures will be added: Screenshots of key visualizations coming soon
# Python 3.8+
pip install numpy pandas matplotlib scipyStep 1: Download Biological Dataset (Required)
# Download from Gene Expression Omnibus
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152027
# File: GSE152027_IOP_raw_Signal.csv.gz (~2GB download, 6GB extracted)
# Rename and place in project root
mv GSE152027_IOP_raw_Signal.csv sample.txtStep 2: Download Array Annotations
# Download from Illumina
# https://support.illumina.com/array/array_kits/infinium_humanmethylation450_beadchip_kit/downloads.html
# File: GPL13534_HumanMethylation450_15017482_v.1.1.csv.gz
# Extract and rename
mv GPL13534_HumanMethylation450_15017482_v.1.1.csv annotations.csv# 1. Generate synthetic data
python simulate_data.py
# 2. Train the model
python main.py
# This will create: params_YYYYMMDD-HHMMSS.pckl
# 3. Evaluate performance
python generate_roc.py
# Edit params_file variable to point to your training output
# Set x to desired training iterationdmr-detection-nhmm/
โโโ simulation_data.py # Synthetic data generation
โโโ estimate.py # Backward probabilities & state sampling
โโโ metropolis.py # Metropolis-Hastings for transitions
โโโ sample_dmc.py # Emission parameter sampling
โโโ main.py # Main training pipeline
โโโ c_with_mh.py # MH algorithm visualization
โโโ generate_roc.py # Model evaluation
โโโ probabilities.py # Probability calculations
โโโ notebooks/ # Jupyter notebooks (90% of codebase)
โ โโโ [development notebooks]
โโโ data/
โ โโโ sample.txt # Raw biological data (6GB)
โ โโโ annotations.csv # Probe annotations
โ โโโ distances.npy # Computed distances
โ โโโ simulated_data.npy # Synthetic methylation
โ โโโ true_states.npy # Ground truth labels
โโโ results/
โโโ params_*.pckl # Saved model parameters
โโโ precision_recall.pdf # Performance figures
Study Design (Co-lead):
- โ Defined research questions and methodology
- โ Designed validation strategy using synthetic data
- โ Planned evaluation metrics (ROC, Precision-Recall)
- โ Structured experimental pipeline
Visualization (Co-lead):
- โ Created Figures 1, 3, 4, 5 showing data distributions
- โ Designed publication-quality plots
- โ Visualized model convergence and performance
- โ Developed diagnostic graphics for parameter estimation
Documentation & Writing:
- โ Co-authored technical documentation
- โ Explained statistical methodology
- โ Documented pipeline usage
Team Members:
- Maya Murry (mmm443) - Study Design, Visualization, Writing
- Mason (msr286) - Study Design, Coding, Visualization
- Viviana - Coding, Writing
- Greg - Coding, Writing
Project Context:
- Cornell University CS 4775 (Computational Biology)
- Fall 2024
- Independent Research Project
- Instructor: [Professor Name]
Problem: Memory constraints when processing 450,000+ CpG sites
Solution: Implemented chunked processing and efficient NumPy operations
Result: Successfully processed complete dataset on standard hardware
Problem: Ensuring parameter sampling converged to optimal values
Solution: Implemented convergence diagnostics and adaptive proposal distributions
Result: Stable parameter estimates visualized in Figure 2
Problem: No ground truth for real biological data
Solution: Created synthetic data with known DMRs for validation
Result: Quantified model accuracy using ROC/Precision-Recall curves
Shen et al. (2017)
"Detecting differentially methylated regions for methyl-seq data using Hidden Markov Model"
- Adapted for psychiatric epigenetics data
- Optimized for large-scale array data (450K probes)
- Enhanced visualization pipeline
- Comprehensive evaluation framework
- PTSD biomarker discovery
- Schizophrenia epigenetic analysis
- General psychiatric genomics research
Potential extensions of this work:
- Clinical Validation: Apply to additional PTSD cohorts
- Multi-Condition Analysis: Extend to other psychiatric conditions
- Integration: Combine with gene expression data
- Real-Time Analysis: Optimize for clinical diagnostic use
- Web Interface: Create tool for researchers to analyze their own data
- Shen et al. (2017). Detecting differentially methylated regions for methyl-seq data using Hidden Markov Model. Bioinformatics.
- Hannon et al. (2020). DNA methylation meta-analysis. GEO Accession GSE152027.
- Illumina HumanMethylation450 BeadChip technical documentation.
- Study this implementation for learning HMMs and Bayesian statistics
- Use the methodology for your own genomics research (with citation)
- Reference our visualization approaches
- Cite this work if using our implementation or methodology
- Credit all team members appropriately
- Respect that this is Cornell academic research
Murry, M., Smith, M., [Viviana], & [Greg]. (2024). DMR Detection Using
Non-Homogeneous Hidden Markov Models. Cornell University, CS 4775 Final Project.
Maya Murry
- Email: [email protected]
- LinkedIn: linkedin.com/in/maya-murry
- Portfolio: mayamurry.com
Project Repository: github.com/snedmagdous/dmr-detection-nhmm
This project was developed as part of Cornell University's CS 4775 (Computational Biology) under the guidance of [Professor Name].
Special thanks to:
- Shen et al. for the foundational HMM framework
- Hannon et al. for providing biological methylation data
- Cornell University Department of Computer Science
- Our research collaborators
Built with ๐งฌ by Maya Murry & Team | Demonstrating advanced statistical modeling and computational genomics expertise
Code: MIT License - Free to use with attribution
Research Content: Academic use only - Contact authors for commercial use
Data: Subject to original data provider licenses (GEO, Illumina)
Implementation:
- 90% Jupyter Notebooks (interactive development and analysis)
- 9% Python scripts (production pipeline)
- 1% Configuration and documentation
Computational Requirements:
- Python 3.8+
- ~8GB RAM (for full dataset)
- ~10GB disk space (including data)
- Optional: GPU for faster training (not required)
Training Time:
- Data simulation: ~10 minutes
- Model training: ~2-4 hours (CPU)
- Evaluation: ~5 minutes