MGM

English| 中文

MGM

🎊 MGM has been published in Advanced Science .

Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.

Installation

From PyPI

Install MGM using pip:

pip install microformer-mgm

From Source

Install MGM from the source code:

python setup.py install

MicroCorpus-260K

The MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:

MicroCorpus-260K.pkl: Normalized microbiome corpus (mean and standard deviation across all samples).
MicroCorpus-260K_unnorm.pkl: Unnormalized microbiome corpus.
mgnify_biomes.csv: Metadata for the samples in the dataset.

Loading MicroCorpus-260K

Load the dataset in Python:

from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0]  # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data  # Access the abundance data

Example Notebook

The MGM_Interpretability_Guideline.ipynb Jupyter Notebook provides a comprehensive guide for using MGM to perform interpretable microbiome analysis. It demonstrates how to:

Fine-tune MGM on an infant microbiome dataset.
Extract sample embeddings for visualization.
Compute attention weights to identify keystone microbial genera.

This notebook is ideal for researchers seeking to understand MGM's interpretability features and apply them to their own datasets.

CLI Usage

MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:

mgm <mode> [options]

Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.

Data Preparation

`construct`

Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.

Input: Abundance data in hdf5, csv, or tsv format (features in rows, samples in columns)
Output: A .pkl file containing the microbiome corpus

Example:

mgm construct -i data/abundance.csv -o data/corpus.pkl

Note: For hdf5 files, use -k to specify the key (default is genus).

Model Training

`pretrain`

Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.

Input:
- Microbiome corpus (.pkl)
- Optional: Label file (.csv, two columns: sample ID and label)
Output: Pretrained MGM model

Examples:

mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-label

Note: Use --from-scratch to train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.

`train`

Trains a supervised MGM model from scratch using labeled data.

Input:
- Microbiome corpus (.pkl)
- Label file (.csv, two columns: sample ID and label)
Output: Supervised MGM model

Example:

mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_model

`finetune`

Finetunes a pretrained MGM model for a specific task using labeled data.

Input:
- Microbiome corpus (.pkl)
- Label file (.csv, two columns: sample ID and label)
- Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
Output: Finetuned MGM model

Example:

mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_model

Inference

`predict`

Generates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.

Input:
- Microbiome corpus (.pkl)
- Optional: Label file (.csv) for evaluation
- Supervised MGM model
Output: Prediction results (.csv)

Example:

mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csv

Note: Use -E with a label file to compare predictions with ground truth.

`generate`

Generates synthetic microbiome data using a pretrained MGM model.

Input:
- Pretrained MGM model
- Optional: Prompt file (.txt, one label per line) for labeled generation
Output: Synthetic genus tensors (.pkl)

Example:

mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pkl

Note: Use -n to specify the number of samples to generate.

`reconstruct`

Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.

Input:
- Abundance file (e.g., csv) for training the reconstructor, or a trained model checkpoint.
- Ranked corpus (.pkl) for reconstruction
- Optional: Generator model and prompt file (text, one label per line) for labeled data
Output:
- Reconstructed corpus (.pkl)
- Reconstructor model (if training)
- Decoded labels (if applicable)

Examples:

mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructed

For more details on any mode, run:

mgm <mode> --help

Maintainers

Name	Email	Organization
Haohong Zhang	[email protected]	PhD Student, School of Life Science and Technology, HUST
Zixin Kang	[email protected]	Undergraduate, School of Life Science and Technology, HUST
Kang Ning	[email protected]	Professor, School of Life Science and Technology, HUST

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
infant_data		infant_data
mgm		mgm
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MGM_Interpretability_Guideline.ipynb		MGM_Interpretability_Guideline.ipynb
README.md		README.md
README_zh.md		README_zh.md
pipeline.png		pipeline.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MGM

Installation

From PyPI

From Source

MicroCorpus-260K

Loading MicroCorpus-260K

Example Notebook

CLI Usage

Data Preparation

`construct`

Model Training

`pretrain`

`train`

`finetune`

Inference

`predict`

`generate`

`reconstruct`

Maintainers

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

HUST-NingKang-Lab/MGM

Folders and files

Latest commit

History

Repository files navigation

MGM

Installation

From PyPI

From Source

MicroCorpus-260K

Loading MicroCorpus-260K

Example Notebook

CLI Usage

Data Preparation

construct

Model Training

pretrain

train

finetune

Inference

predict

generate

reconstruct

Maintainers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`construct`

`pretrain`

`train`

`finetune`

`predict`

`generate`

`reconstruct`

Packages