pg_gen is a Snakemake workflow that takes PacBio subread BAM files as input, and generates FASTQ files containing CCS (Circular Consensus Sequencing) reads. It uses standard PacBio software from the pbbioconda / IsoSeq software suite. pb_gen runs only on Linux.
Move to the directory where you intend to run pb_gen and clone the source from GitHub:
git clone https://github.com/julienlag/pb_gen.git
This workflow uses Snakemake and Conda environments. Therefore, a working installation of conda and Snakemake is a prerequisite. The Snakemake/Conda setup ensures that all software pg_gen depends on will be installed automatically by Snakemake in an isolated environment the first time you run the workflow.
If using DRMAA (Distributed Resource Management Application API), you should install its Python bindings on your system (in the same environment as the one in which you're running Snakemake):
conda install -c anaconda drmaa
As with any other Snakemake workflow, the command issued will depend on your computing environment. Here's an example which runs well in a UGE/DRMAA cluster environment, in my experience ({variables} in curly braces are expanded by Snakemake):
snakemake -p --reason --latency-wait 100 --use-conda -s master.smk -j 4500 \
--configfile config_pacBio_AlzheimerBrain_ccs.json `# change according to your needs` \
--jobname {rulename}.{jobid}._pb_gen `# job naming rule for the HPC scheduler` \
--cluster-config cluster_config.json `# cluster configuration (queue names, job resource requirements etc.). This file is not provided in the current repo` \
--max-jobs-per-second 10 \
--drmaa " -V -q {cluster.queue} -l disk={cluster.disk} -l virtual_free={cluster.virtual_free} -l h_rt={cluster.h_rt}  -o {cluster.out} -e {cluster.err} {cluster.threads} -P {cluster.project}" \
--rerun-incomplete --keep-going  --show-failed-logs
Mandatory input files include:
- 
Subread BAM files should be placed in a subdirectory named raw/and be named according to the following scheme:<runId>.subreads.bamwhere <runId>is an arbitrary unique string identifying each sequencing run.
- 
FASTA file of adapter sequences: see PB_ADAPTconfig variable in Configuration variables below).
One gzipped FASTQ file is produced per input BAM file. The output FASTQ files are written in the ccs/fastq/ subdirectory and are named according to the following scheme:
<runId>.min-rq<minRQpostCcs>.min-passes<minPasses>.fastq.gz
<runId> is the basename of the corresponding input subread BAM, while <minRQpostCcs> and <minPasses> represent the min-rq (Minimum predicted accuracy, default 0.99) and min-passes (Minimum number of full-length subreads required to generate a CCS read, default 1) options passed to the IsoSeq software. These last two variables are currently hardcoded.
The output files generated by each intermediate rule are removed after all rules that use it as input are completed (see relevant Snakemake documentation here). Most report and summary files generated along the way by IsoSeq programs are kept, though.
These can be specified via the --config or --config-file Snakemake options (see Snakemake CLI documentation here).
- 
PB_ADAPT: Path to the FASTA file containing the 5' and 3' adapter sequences used in the cDNA library preparation.PB_ADAPTis used in rulesremovePrimers(limaexecution) andremoveChimeras(isoseq refineexecution). These primer sequences should be identified asprimer_5pandprimer_3pin the FASTA file, respectively. See example file namedpb_adapters.fa(corresponding to PacBio's Alzheimer Brain dataset) in the current repository.
- 
TMPDIR: directory to write temporary files to during job executions. If in doubt, set it to/tmp/. In some cluster environments, it's useful to set it to$TMPDIR, which is dynamically assigned to each job by the scheduler.
The pb_gen processing pipeline is pretty close to PacBio's recommendations, with some tweaks. The goal of these modifications is to make the output CCS FASTQs compliant with LyRic, our long-read transcriptome analysis workflow.
The most notable differences between pb_gen and the standard IsoSeq pipeline are:
- pb_gendoes not trim poly(A) tails from the reads.
- pb_genskips the clustering step (- isoseq3 cluster). Inter-read clustering is considered optional. Clustering reads into non-redundant transcript models is performed by LyRic at a later stage, based on their alignment to the reference genome.