English| 中文
🎊 MGM has been published in Advanced Science .
Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.
Install MGM using pip:
pip install microformer-mgmInstall MGM from the source code:
python setup.py installThe MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:
MicroCorpus-260K.pkl: Normalized microbiome corpus (mean and standard deviation across all samples).MicroCorpus-260K_unnorm.pkl: Unnormalized microbiome corpus.mgnify_biomes.csv: Metadata for the samples in the dataset.
Load the dataset in Python:
from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0] # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data # Access the abundance dataThe MGM_Interpretability_Guideline.ipynb Jupyter Notebook provides a comprehensive guide for using MGM to perform interpretable microbiome analysis. It demonstrates how to:
- Fine-tune MGM on an infant microbiome dataset.
- Extract sample embeddings for visualization.
- Compute attention weights to identify keystone microbial genera.
This notebook is ideal for researchers seeking to understand MGM's interpretability features and apply them to their own datasets.
MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:
mgm <mode> [options]Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.
Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.
- Input: Abundance data in
hdf5,csv, ortsvformat (features in rows, samples in columns) - Output: A
.pklfile containing the microbiome corpus
Example:
mgm construct -i data/abundance.csv -o data/corpus.pklNote: For
hdf5files, use-kto specify the key (default isgenus).
Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.
- Input:
- Microbiome corpus (
.pkl) - Optional: Label file (
.csv, two columns: sample ID and label)
- Microbiome corpus (
- Output: Pretrained MGM model
Examples:
mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-labelNote: Use
--from-scratchto train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.
Trains a supervised MGM model from scratch using labeled data.
- Input:
- Microbiome corpus (
.pkl) - Label file (
.csv, two columns: sample ID and label)
- Microbiome corpus (
- Output: Supervised MGM model
Example:
mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_modelFinetunes a pretrained MGM model for a specific task using labeled data.
- Input:
- Microbiome corpus (
.pkl) - Label file (
.csv, two columns: sample ID and label) - Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
- Microbiome corpus (
- Output: Finetuned MGM model
Example:
mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_modelGenerates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.
- Input:
- Microbiome corpus (
.pkl) - Optional: Label file (
.csv) for evaluation - Supervised MGM model
- Microbiome corpus (
- Output: Prediction results (
.csv)
Example:
mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csvNote: Use
-Ewith a label file to compare predictions with ground truth.
Generates synthetic microbiome data using a pretrained MGM model.
- Input:
- Pretrained MGM model
- Optional: Prompt file (
.txt, one label per line) for labeled generation
- Output: Synthetic genus tensors (
.pkl)
Example:
mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pklNote: Use
-nto specify the number of samples to generate.
Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.
- Input:
- Abundance file (e.g.,
csv) for training the reconstructor, or a trained model checkpoint. - Ranked corpus (
.pkl) for reconstruction - Optional: Generator model and prompt file (text, one label per line) for labeled data
- Abundance file (e.g.,
- Output:
- Reconstructed corpus (
.pkl) - Reconstructor model (if training)
- Decoded labels (if applicable)
- Reconstructed corpus (
Examples:
mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructedFor more details on any mode, run:
mgm <mode> --help| Name | Organization | |
|---|---|---|
| Haohong Zhang | [email protected] | PhD Student, School of Life Science and Technology, HUST |
| Zixin Kang | [email protected] | Undergraduate, School of Life Science and Technology, HUST |
| Kang Ning | [email protected] | Professor, School of Life Science and Technology, HUST |