data

MSA Data Collection and Filtering

This folder contains scripts and files related to the generation, filtering, and usage of multiple sequence alignments (MSAs) for the OLF domain.

Step 1: Get MSA from hmmer

Note: we provide the finalized dataset at `subsampled_MSA.a2m`. These steps are only to reproduce our results.

Ensure you have the EVcouplings repository cloned into this directory, as we use it for running hmmer.

git clone https://github.com/debbiemarkslab/EVcouplings.git

Ensure that you have hmmer installed and available at the folder: profilehmm/hmmer-3.4. You also have to download the UniRef100 database and place it at profilehmm/hmmer-3.4/bin/uniref100.fasta.

You also need to create a conda environment using the msagen.yml file provided:

conda env create -f msagen.yml
conda activate msagen

For an example of how to run hmmer through EVcouplings, refer to scripts/msagen.py.

The MSA generated from this is located at results/OLF_HUMAN/align/OLF_HUMAN.a2m.

Step 2: First round of filtering

Run filtering.py to remove short/duplicate sequences and filter the MSA to only include sequences with 200 residues or more. This generates a filtered MSA at OLF_filtered.a2m.

Step 3: MMseqs2 clustering and subsampling

We use MMseqs2 to cluster the MSA.

See the main README.md for additional details on running MMseqs2.

This generates the subsampled MSA at subsampled_MSA.a2m.

File Descriptions

File/Folder	Description
`fasta/OLF.fasta`	FASTA file for the OLF domain.
`scripts/msagen.py`	Script to run `hmmer` through EVcouplings.
`scripts/config_files/OLF_HUMAN.txt`	Configuration file specifying `hmmer` parameters.
`results/OLF_HUMAN/align/OLF_HUMAN.a2m`	Output MSA from EVcouplings.
`filtering.py`	First round of filtering: removes duplicate sequences and drops those shorter than 200 residues.
`OLF_filtered.a2m`	Filtered MSA after removing short/duplicate sequences.
`subsampled_MSA.a2m`	Subsampled MSA after MMseqs2 clustering.

For Phylogenetic Analysis

To see the sequences used for phylogeny, refer to:

../phylo/MSA_80cluster.fas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MSA Data Collection and Filtering

Step 1: Get MSA from hmmer

Note: we provide the finalized dataset at `subsampled_MSA.a2m`. These steps are only to reproduce our results.

Step 2: First round of filtering

Step 3: MMseqs2 clustering and subsampling

File Descriptions

For Phylogenetic Analysis

Name		Name	Last commit message	Last commit date
parent directory ..
fasta		fasta
results/OLF_HUMAN/align		results/OLF_HUMAN/align
scripts		scripts
OLF_filtered.a2m		OLF_filtered.a2m
README.md		README.md
filtering.py		filtering.py
msagen.yml		msagen.yml
subsampled_MSA.a2m		subsampled_MSA.a2m

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

MSA Data Collection and Filtering

Step 1: Get MSA from hmmer

Note: we provide the finalized dataset at subsampled_MSA.a2m. These steps are only to reproduce our results.

Step 2: First round of filtering

Step 3: MMseqs2 clustering and subsampling

File Descriptions

For Phylogenetic Analysis

Note: we provide the finalized dataset at `subsampled_MSA.a2m`. These steps are only to reproduce our results.