Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ZhuJiwei111/SPACE

Repository files navigation

🧬 SPACE

This is the official implementation of the ICML 2025 poster paper "SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model".

Environment Setup

You can create a conda environment using the provided environment.yml file to ensure all dependencies are properly installed:

conda env create -f environment.yml
conda activate space

Dataset

We utilize the dataset from Basenji, which is originally in TensorFlow data format and requires users to pay for download costs. We have converted the data to H5 format and made it freely available for download on 🤗 Hugging Face: https://huggingface.co/datasets/yangyz1230/space.

Note: The original data we provide is in compressed H5 format, which is not conducive to parallel data loading with dataloaders during training. We have provided preprocessing code in our repository to convert the compressed H5 format to byte streams:

def preprocess_data(self):
print(f"There is no preprocessed data at {self.preprocess_data_path}, so we will preprocess it now.")
print("Preprocessing data...")
with open(self.preprocess_data_path, 'wb') as f:
for idx in tqdm(range(len(self))):
item = self.process(idx)
x = item['x']
labels = item['labels']
# Write the data
f.write(x.tobytes())
f.write(labels.tobytes())
. You can also convert the H5 data yourself to formats suitable for large-scale training, such as WebDataset.

Pre-trained Model

We have uploaded our model config and weights to 🤗 Hugging Face Hub at: https://huggingface.co/yangyz1230/space. You can easily load the pre-trained model using the following code:

from model.modeling_space import Space
model_name_or_path = "yangyz1230/space"
model = Space.from_pretrained(model_name_or_path)

Try the Genomic Profile Prediction

We provide code for genomic profile prediction in both test_Enformer.ipynb and test_SPACE.ipynb. Then you can visualize predicted genomic profiles throught visulization.py.

Quick Start for Enformer/Borzoi Training

If you are specifically interested in reproducing Enformer and similar models like Borzoi with minimal setup, please refer to our simplified repository: Enformer_Borzoi_Training_PyTorch. This repository provides streamlined training scripts with minimal modifications to the original Hugging Face Trainer, making it easier to get started with genomics model training.

SPACE Pre-training

You can train a SPACE model from scratch:

bash train.sh

Downstream Tasks

We provide code for reproducing downstream tasks in the folder experiments/. This requires ensuring your downstream task data is stored in the datasets/ directory.

For example, to reproduce NT benchmark's results:

  1. Place the NT dataset in datasets/NT/
  2. Run the following commands:
bash experiments/NT/expr_NT.sh

Acknowledgments

Our implementation is based on Enformer-Pytorch and Mod-Squad. We thank their excellent work.

Citation

If you find our work, code, or released data helpful, please cite:

@inproceedings{yang2025space,
      title={{SPACE}: Your Genomic Profile Predictor is a Powerful {DNA} Foundation Model},
      author={Zhao Yang and Jiwei Zhu and Bing Su},
      booktitle={Forty-second International Conference on Machine Learning},
      year={2025},
      url={https://openreview.net/forum?id=o4L9y4Jetm}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •