HiCMamba: Enhancing Hi-C Resolution and Identifying 3D Genome Structures with State Space Modeling
- Download raw Hi-C data and set the environment variables
- Download the raw Hi-C data from GSE62525. GM12878 and K562 are used in this work.
- Create the
RAW_dirdirectory under your root directory and then unzip the raw Hi-C data into this directory. - Set the related variables in
dataset_information.py, includingroot_dirandRAW_dir. - Also, set the directory name to store different data file.
- RAW_dir: stores raw hic data
- hic_matrix_dir: stores the hic matrices in npz format.
- data_dir: stores the data for training, validation and test.
- Run
data_processing.Preprocess.pyto process the raw Hi-C data and generate data for training and testing
-
The script consists of four key step:
- Read raw data files from RAW_dir, save them in numpy matrix style(.npz files) in
hic_matrix_dir. - Read the high-coverage numpy matrices and downsampling them.
- Data normalization.
- Data division, transforming the data into 40 * 40 submatrices.
- Read raw data files from RAW_dir, save them in numpy matrix style(.npz files) in
-
The script can be executed using the following command:
python -m data_processing.Preprocess -c GM12878
python -m data_processing.Preprocess -c K562
The well preprocessed data is accessible at Google Drive. You can use the preprocessed data through downloading the data and then moving this data into data_processing directory.
- Training
python train.py
- Testing
python test.py
- Python 3.8.18
- Pytorch 1.13.0+cu117
- causal-conv1d 1.0.0
- mamba_ssm 1.0.1
- Numpy 1.24.4
- Scipy 1.10.1
- Pandas 2.0.3
- Scikit-learn 1.3.2
- Matplotlib 3.7.5
- tqdm 4.66.2
The causal-conv1d and mamba_ssm are strongly recommended downloaded from the BaiduNetdisk Link provided by VM-UNet. And then install the packages using:
pip install xxx.whl
We express our gratitude to the authors of HiCARN for sharing their open-source code.