AccuNGS is the main computational pipeline used by SternLab.
This Python implementation overides and improves over the previous perl implementation which is still available under Stern-Lab/AccuNGS.
Within AccuNGS wet protocol, the sequencing library is created to maximize overlap of the Forward and Reverse reads of a paired-end Illumina sequencing run, in a manner that allows for two observations of each base of the original insert for the purpose of increasing the accuracy of basecalling.
conda env create --file requirements.txt name_of_new_environment
AccuNGS pipeline was designed to run locally while using available memory and cpus efficiently and it also natively supports running on a PBS cluster system. AccuNGS has three required parameters and all other parameters have default values which can be changed by editing the file config.ini in the installation directory. AccuNGS pipeline was designed to run on fastq files with maximal overlap between the forward and reverse reads but can also be run on fastq files without any overlap using the using the parameter --overlapping_reads N
todo: explain freqs, consensus, called_bases, filttered things and maybe examples of graphs.
-i / --input_dir- Path to directory containing fastq/fastq.gz files or sub directories containg fastq/fastq.gz files.-o / --output_dir- Path to directory where output files will go.-r / --reference_file- Full path to reference fasta file which fastq files will be aligned against using BLAST.
-m / --max_bascall_interations- Maximum number of Process loop iterations (see Process).-or / --overlapping_reads- Y/N/P, run pipeline with, without or with partial overlapping reads. Y would ignore all bases without an overlap, N assumes reads are independent and ignores overlap completely and P uses all available information.-qt / --quality_threshold- Filter out all nucleotides with phred score lower than this.-mc / --min_coverage- Positions with less than this coverage will be replaced by N's in the consensus.-mf / --min_frequency- Positions with less than this frequency will be replaced by N's in the consensus.-ar / --align_to_ref- Y/N, align consensus to original reference.
The pipeline allows control of some of BLAST's parameters in order to get a better alignment. For an indepth view of all of BLAST's parameters see BLAST's documentation.
-bt / --blast_task- BLAST's task parameter.-be / --blast_evalue- BLAST's evalue parameter.-bd / --blast_dust- BLAST's dust parameter.-bn / --blast_num_alignments- BLAST's num_alignments parameter.-bp / --blast_perc_identity- BLAST's perc_identity parameter.-bs / --blast_soft_masking- BLAST's soft_masking parameter.
-c / --cleanup-Y/N, remove redundant basecall directory when done.-cc / --cpu_count- Max number of CPUs to use (0 means all).-mm / --max_memory- Limit memory usage to this many megabytes (0 would use available memory in the begining of execution).
todo: runner, pbs_runner, pbs_multi_runner, project_runner & config.ini
The pipeline is divided into 4 stages each having it's own .py file:
I - Prepare Data - Prepare files for efficient parallel processing
II - Process - Parallel run BLAST and basecall on each of the outputs of the previous step
III - Aggregate - Aggregate outputs of parallel runs (aggregation.py)
IV - Summarize - Draw some graphs and a short text summary
Filename: data_preperation.py
Takes a directory containing fastq/gz files or a directory containing sub directories containing fastq/gz files and outputs fastq files which are ready for processing.
If parameter --overlapping_reads / -or is set to 'Y' or 'P' (partial) the opposing reads will be merged into a single file.
Based on the values in --max_memory / --mm and --cpu_count / -cc the pipeline will divivde the input files into an efficient number of files to run simultanesouly. If the values are not given the pipeline will try to estimate these numbers according to the currently available system resources.
Filename: processing.py
Runs BLAST and then basecall on each of the files created in stage I in parallel. BLAST is a local alignment tool used to align the reads with the reference file given by parameter --reference / -r . All the relevant BLAST variables can be set with their corresponding parameters. Basecall loops over every nucleotide and decides whether to filter it out based on the relevant parameters. After the parallel run on all stage I output files is complete, a freqs file and a consensus alignment are created. If the concensus and the reference files are identical, the pipeline continues to stage III. If they are not identical, stage II is repeated but with the newly created consensus file given as the new reference file. This loop continues until the consensus converges fully or a maximuam threshold of iterations set by --max_basecall_iterations / -m is reached.
Filename: agrregation.py
Aggregates and cleans up outputs of stage II and creates the main output files.
Filename: summarize.py
Creates a graphical and textual summary of the output.