IGBuddy automates the data extraction and processing pipeline for the Doris Lab by converting and analyzing sequence data. This tool converts .bam files to .fasta, splits .fasta files, indexes them, and extracts sequences based on specific targets using various bioinformatics tools.
To run this tool, the following programs must be installed and accessible in your system's PATH:
- samtools 1.9 - for converting
.bamfiles to.fasta - seqtk - for splitting
.fastafiles and retrieving specific sequences - fatotwobit - for converting
.fastafiles to.2bit - blat and blatSrc - for aligning sequences to targets
Ensure that these tools are available on your system before starting.
-
Clone this repository:
git clone https://github.com/kyraezikeuzor/ig-buddy.git cd ig-buddy -
Prepare a folder structure in the following format:
- Create a folder named
bio-sample-[number]. - Place the
.bamfile for your sample within this folder.
- Create a folder named
-
Place your strain-specific target files in a designated location (or in the same directory as this tool for ease of use).
-
Run the setup functions to initialize the environment variables:
set_target_options() # Loads strain-specific target options from `targets.txt`. set_fatotwobit_script("/path/to/fatotwobit") # Set the path to the faToTwoBit script. set_blat_script("/path/to/blat") # Set the path to the BLAT script. set_igblast_script("/path/to/igblast") # Set the path to the IgBlast script.
-
Start the data processing pipeline with your
.bamfile by following these steps:
-
Convert
.bamto.fasta:convert_bam_to_fasta("sample.bam", "sample.fasta")
Converts
.bamfiles to.fastaformat for further processing. -
Split
.fastainto smaller chunks:split_fasta_file("sample.fasta", "output_prefix")
Splits the
.fastafile into 10 smaller files for more manageable analysis. -
Index
.fastafiles usingfaToTwoBit:index_fasta_file("chunk_0001.fasta", "chunk_0001.2bit")
Converts
.fastafiles to.2bitformat for faster access by BLAT. -
Extract sequences of interest using BLAT:
extract_sequences_of_interest("database.2bit", "query.txt", "output.txt")
Uses BLAT to extract sequences that match specific targets, outputting them in BLAST format.
-
Extract identifiers with the highest score:
identifiers = extract_identifiers("target_file.txt")
Retrieves identifiers of sequences with the highest score based on strain-specific targets.
-
Append identifiers to a master file:
append_list_of_identifiers("master_file.txt", identifiers)
Adds identifiers to a specified master file for further analysis.
-
Retrieve sequences that match specific identifiers:
match_sequences("sample.fasta", "identifiers.txt", "matching_sequences.fasta")
Extracts sequences from
.fastafile based on a list of identifiers.
Errors during file operations or external command executions will be caught and displayed, making it easier to troubleshoot issues such as missing files or incorrect paths.
If you'd like to contribute to this project, please fork the repository and use a feature branch. Pull requests are welcome.
This project is licensed under the MIT License.