This is the official implementation of the ICML 2025 poster paper "SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model".
You can create a conda environment using the provided environment.yml file to ensure all dependencies are properly installed:
conda env create -f environment.yml
conda activate spaceWe utilize the dataset from Basenji, which is originally in TensorFlow data format and requires users to pay for download costs. We have converted the data to H5 format and made it freely available for download on 🤗 Hugging Face: https://huggingface.co/datasets/yangyz1230/space.
Note: The original data we provide is in compressed H5 format, which is not conducive to parallel data loading with dataloaders during training. We have provided preprocessing code in our repository to convert the compressed H5 format to byte streams:
SPACE/dataloaders/h5dataset.py
Lines 120 to 131 in 6fc1aee
We have uploaded our model config and weights to 🤗 Hugging Face Hub at: https://huggingface.co/yangyz1230/space. You can easily load the pre-trained model using the following code:
from model.modeling_space import Space
model_name_or_path = "yangyz1230/space"
model = Space.from_pretrained(model_name_or_path)We provide code for genomic profile prediction in both test_Enformer.ipynb and test_SPACE.ipynb. Then you can visualize predicted genomic profiles throught visulization.py.
If you are specifically interested in reproducing Enformer and similar models like Borzoi with minimal setup, please refer to our simplified repository: Enformer_Borzoi_Training_PyTorch. This repository provides streamlined training scripts with minimal modifications to the original Hugging Face Trainer, making it easier to get started with genomics model training.
You can train a SPACE model from scratch:
bash train.sh
We provide code for reproducing downstream tasks in the folder experiments/. This requires ensuring your downstream task data is stored in the datasets/ directory.
For example, to reproduce NT benchmark's results:
- Place the NT dataset in
datasets/NT/ - Run the following commands:
bash experiments/NT/expr_NT.sh
Our implementation is based on Enformer-Pytorch and Mod-Squad. We thank their excellent work.
If you find our work, code, or released data helpful, please cite:
@inproceedings{yang2025space,
title={{SPACE}: Your Genomic Profile Predictor is a Powerful {DNA} Foundation Model},
author={Zhao Yang and Jiwei Zhu and Bing Su},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=o4L9y4Jetm}
}