This folder contains scripts and files related to the generation, filtering, and usage of multiple sequence alignments (MSAs) for the OLF domain.
Note: we provide the finalized dataset at subsampled_MSA.a2m. These steps are only to reproduce our results.
Ensure you have the EVcouplings repository cloned into this directory, as we use it for running hmmer.
git clone https://github.com/debbiemarkslab/EVcouplings.gitEnsure that you have hmmer installed and available at the folder: profilehmm/hmmer-3.4. You also have to download the UniRef100 database and place it at profilehmm/hmmer-3.4/bin/uniref100.fasta.
You also need to create a conda environment using the msagen.yml file provided:
conda env create -f msagen.yml
conda activate msagenFor an example of how to run hmmer through EVcouplings, refer to scripts/msagen.py.
The MSA generated from this is located at results/OLF_HUMAN/align/OLF_HUMAN.a2m.
Run filtering.py to remove short/duplicate sequences and filter the MSA to only include sequences with 200 residues or more. This generates a filtered MSA at OLF_filtered.a2m.
We use MMseqs2 to cluster the MSA.
See the main README.md for additional details on running MMseqs2.
This generates the subsampled MSA at subsampled_MSA.a2m.
| File/Folder | Description |
|---|---|
fasta/OLF.fasta |
FASTA file for the OLF domain. |
scripts/msagen.py |
Script to run hmmer through EVcouplings. |
scripts/config_files/OLF_HUMAN.txt |
Configuration file specifying hmmer parameters. |
results/OLF_HUMAN/align/OLF_HUMAN.a2m |
Output MSA from EVcouplings. |
filtering.py |
First round of filtering: removes duplicate sequences and drops those shorter than 200 residues. |
OLF_filtered.a2m |
Filtered MSA after removing short/duplicate sequences. |
subsampled_MSA.a2m |
Subsampled MSA after MMseqs2 clustering. |
To see the sequences used for phylogeny, refer to:
../phylo/MSA_80cluster.fas