Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Stern-Lab/PyAccuNGS

Repository files navigation

PyAccuNGS

Introduction

AccuNGS is the main computational pipeline used by SternLab.
This Python implementation overides and improves over the previous perl implementation which is still available under Stern-Lab/AccuNGS.
Within AccuNGS wet protocol, the sequencing library is created to maximize overlap of the Forward and Reverse reads of a paired-end Illumina sequencing run, in a manner that allows for two observations of each base of the original insert for the purpose of increasing the accuracy of basecalling.

Installation

conda env create --file requirements.txt name_of_new_environment

Usage

AccuNGS pipeline was designed to run locally while using available memory and cpus efficiently and it also natively supports running on a PBS cluster system. AccuNGS has three required parameters and all other parameters have default values which can be changed by editing the file config.ini in the installation directory. AccuNGS pipeline was designed to run on fastq files with maximal overlap between the forward and reverse reads but can also be run on fastq files without any overlap using the using the parameter --overlapping_reads N

Output

todo: explain freqs, consensus, called_bases, filttered things and maybe examples of graphs.

Parameters

Required Parameters

  • -i / --input_dir - Path to directory containing fastq/fastq.gz files or sub directories containg fastq/fastq.gz files.
  • -o / --output_dir - Path to directory where output files will go.
  • -r / --reference_file - Full path to reference fasta file which fastq files will be aligned against using BLAST.

Basecall Parameters

  • -m / --max_bascall_interations - Maximum number of Process loop iterations (see Process).
  • -or / --overlapping_reads - Y/N/P, run pipeline with, without or with partial overlapping reads. Y would ignore all bases without an overlap, N assumes reads are independent and ignores overlap completely and P uses all available information.
  • -qt / --quality_threshold - Filter out all nucleotides with phred score lower than this.
  • -mc / --min_coverage - Positions with less than this coverage will be replaced by N's in the consensus.
  • -mf / --min_frequency - Positions with less than this frequency will be replaced by N's in the consensus.
  • -ar / --align_to_ref - Y/N, align consensus to original reference.

BLAST Parameters

The pipeline allows control of some of BLAST's parameters in order to get a better alignment. For an indepth view of all of BLAST's parameters see BLAST's documentation.

  • -bt / --blast_task - BLAST's task parameter.
  • -be / --blast_evalue - BLAST's evalue parameter.
  • -bd / --blast_dust - BLAST's dust parameter.
  • -bn / --blast_num_alignments - BLAST's num_alignments parameter.
  • -bp / --blast_perc_identity - BLAST's perc_identity parameter.
  • -bs / --blast_soft_masking - BLAST's soft_masking parameter.

Efficieny Parameters

  • -c / --cleanup -Y/N, remove redundant basecall directory when done.
  • -cc / --cpu_count - Max number of CPUs to use (0 means all).
  • -mm / --max_memory - Limit memory usage to this many megabytes (0 would use available memory in the begining of execution).

Running on a PBS cluster

todo: runner, pbs_runner, pbs_multi_runner, project_runner & config.ini

Algorithm Overview

The pipeline is divided into 4 stages each having it's own .py file:
I - Prepare Data - Prepare files for efficient parallel processing
II - Process - Parallel run BLAST and basecall on each of the outputs of the previous step
III - Aggregate - Aggregate outputs of parallel runs (aggregation.py)
IV - Summarize - Draw some graphs and a short text summary

Prepare Data

Filename: data_preperation.py
Takes a directory containing fastq/gz files or a directory containing sub directories containing fastq/gz files and outputs fastq files which are ready for processing.
If parameter --overlapping_reads / -or is set to 'Y' or 'P' (partial) the opposing reads will be merged into a single file.
Based on the values in --max_memory / --mm and --cpu_count / -cc the pipeline will divivde the input files into an efficient number of files to run simultanesouly. If the values are not given the pipeline will try to estimate these numbers according to the currently available system resources.

Process

Filename: processing.py
Runs BLAST and then basecall on each of the files created in stage I in parallel. BLAST is a local alignment tool used to align the reads with the reference file given by parameter --reference / -r . All the relevant BLAST variables can be set with their corresponding parameters. Basecall loops over every nucleotide and decides whether to filter it out based on the relevant parameters. After the parallel run on all stage I output files is complete, a freqs file and a consensus alignment are created. If the concensus and the reference files are identical, the pipeline continues to stage III. If they are not identical, stage II is repeated but with the newly created consensus file given as the new reference file. This loop continues until the consensus converges fully or a maximuam threshold of iterations set by --max_basecall_iterations / -m is reached.

Aggregate

Filename: agrregation.py
Aggregates and cleans up outputs of stage II and creates the main output files.

Summarize

Filename: summarize.py
Creates a graphical and textual summary of the output.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages