🧬 SPACE

This is the official implementation of the ICML 2025 poster paper "SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model".

Environment Setup

You can create a conda environment using the provided environment.yml file to ensure all dependencies are properly installed:

conda env create -f environment.yml
conda activate space

Dataset

We utilize the dataset from Basenji, which is originally in TensorFlow data format and requires users to pay for download costs. We have converted the data to H5 format and made it freely available for download on 🤗 Hugging Face: https://huggingface.co/datasets/yangyz1230/space.

Note: The original data we provide is in compressed H5 format, which is not conducive to parallel data loading with dataloaders during training. We have provided preprocessing code in our repository to convert the compressed H5 format to byte streams:

SPACE/dataloaders/h5dataset.py

Lines 120 to 131 in 6fc1aee

    
           def preprocess_data(self): 
        
               print(f"There is no preprocessed data at {self.preprocess_data_path}, so we will preprocess it now.") 
        
               print("Preprocessing data...") 
        
               with open(self.preprocess_data_path, 'wb') as f: 
        
                   for idx in tqdm(range(len(self))): 
        
                       item = self.process(idx) 
        
                       x = item['x'] 
        
                       labels = item['labels'] 
        
                       # Write the data 
        
                       f.write(x.tobytes()) 
        
                       f.write(labels.tobytes())

. You can also convert the H5 data yourself to formats suitable for large-scale training, such as WebDataset.

Pre-trained Model

We have uploaded our model config and weights to 🤗 Hugging Face Hub at: https://huggingface.co/yangyz1230/space. You can easily load the pre-trained model using the following code:

from model.modeling_space import Space
model_name_or_path = "yangyz1230/space"
model = Space.from_pretrained(model_name_or_path)

Try the Genomic Profile Prediction

We provide code for genomic profile prediction in both test_Enformer.ipynb and test_SPACE.ipynb. Then you can visualize predicted genomic profiles throught visulization.py.

Quick Start for Enformer/Borzoi Training

If you are specifically interested in reproducing Enformer and similar models like Borzoi with minimal setup, please refer to our simplified repository: Enformer_Borzoi_Training_PyTorch. This repository provides streamlined training scripts with minimal modifications to the original Hugging Face Trainer, making it easier to get started with genomics model training.

SPACE Pre-training

You can train a SPACE model from scratch:

bash train.sh

Downstream Tasks

We provide code for reproducing downstream tasks in the folder experiments/. This requires ensuring your downstream task data is stored in the datasets/ directory.

For example, to reproduce NT benchmark's results:

Place the NT dataset in datasets/NT/
Run the following commands:

bash experiments/NT/expr_NT.sh

Acknowledgments

Our implementation is based on Enformer-Pytorch and Mod-Squad. We thank their excellent work.

Citation

If you find our work, code, or released data helpful, please cite:

@inproceedings{yang2025space,
      title={{SPACE}: Your Genomic Profile Predictor is a Powerful {DNA} Foundation Model},
      author={Zhao Yang and Jiwei Zhu and Bing Su},
      booktitle={Forty-second International Conference on Machine Learning},
      year={2025},
      url={https://openreview.net/forum?id=o4L9y4Jetm}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

🧬 SPACE

Environment Setup

Dataset

Pre-trained Model

Try the Genomic Profile Prediction

Quick Start for Enformer/Borzoi Training

SPACE Pre-training

Downstream Tasks

Acknowledgments

Citation

About

Uh oh!

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
dataloaders		dataloaders
experiments		experiments
model		model
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
MyTrainer.py		MyTrainer.py
README.md		README.md
environment.yml		environment.yml
test_Enformer.ipynb		test_Enformer.ipynb
test_SPACE.ipynb		test_SPACE.ipynb
train.py		train.py
train.sh		train.sh
visulization.py		visulization.py

	def preprocess_data(self):
	print(f"There is no preprocessed data at {self.preprocess_data_path}, so we will preprocess it now.")
	print("Preprocessing data...")
	with open(self.preprocess_data_path, 'wb') as f:
	for idx in tqdm(range(len(self))):
	item = self.process(idx)
	x = item['x']
	labels = item['labels']

	# Write the data
	f.write(x.tobytes())
	f.write(labels.tobytes())

Uh oh!

License

Uh oh!

ZhuJiwei111/SPACE

Folders and files

Latest commit

History

Repository files navigation

🧬 SPACE

Environment Setup

Dataset

Pre-trained Model

Try the Genomic Profile Prediction

Quick Start for Enformer/Borzoi Training

SPACE Pre-training

Downstream Tasks

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Contributors 2

Uh oh!

Languages

Packages