LongQC is a tool for the data quality control of the PacBio and ONT long reads, and it has two functionalities: sample qc and platform qc.
Sample QC: this accepts standard sequence file formats, Fastq, Fasta and subread BAM from PacBio sequencers, and users can check whether their data are ready to analysis or not. You don't need any reference, meta information, even quality score.
Platform QC: this extracts and provides very fundamental stats for a run such as length or productivity in PacBio and some plots for productivity check in ONT.
Long reads from third generation sequencers have high error rate (~15%) and quite different technological background from NGS (e.g. there is no equivalent to Illumina's cycle). In addition, base-level quality score is occasionaly fluctuating, too high even for noisy ones or completely unavailable in Sequel. Therefore, mapping back reads to references and overseeing error profile have been conducted for the QC purpose. However, this approach has dependency for the reference, which can be an evitable issue: reference is not always available. Besides, this is typical in production lab, finding an appropriate reference is not a trivial task. If one selects very close but distinct reference from public databases, statistics can be quite different (what you will see is the summation of evolutionary divergence and error).
LongQC was developed to overcome such situations, and it gives you a first insight of your data within a short period of time.
docker image is mainteined, and we recommended running LongQC on the docker. Having said that, if you want to setup manually, kidly follow below steps.
LongQC was written in python3 and has dependencies for popular python libraries:
- numpy
- scipy
- matplotlib
- scikit-learn
- pandas
- jinja2
Anaconda should be easier choice. We recommend anaconda3, then install below dependency using conda.
a) conda install pysam
b) conda install edlib
conda install python-edlib
Modified version of minimap2 named minimap2-coverage is also required. If you are a Mac user, you have to prepare libc for argp.h.
cd /path/to/download/
git clone https://github.com/yfukasawa/LongQC.git
cd LongQC/minimap2_mod && make extra
Then, change the below variable in longQC.py.
path_minimap2 = /path/to/minimap2_mod/
Or, put both minimap2-coverage and sdust to some dirs in PATH.
Argp has to be installed. Using homebrew seems to be easiest.
brew argp-standalone
See the docker file in this repository. All of dependency will be automatically resoleved. I tested the docker image of LongQC on both Linux and Mac.
docker build -t longqc --build-arg USER="foo" .
The above command simply build a new container named LongQC. You can name username in the docker environment by passing to USER. The above example uses foo.
docker run -it --rm -v /path/to/shared_dir/:/data longqc
Run the LongQC container built by the above command. The container uses /data as a default workspace, and the above command mounts /data to shared_dir in the host.
python longQC.py sampleqc -x pb-rs2 -o out_dir input_reads.(fa,fq)
python longQC.py sampleqc -x pb-sequel -o out_dir input_reads.bam
python longQC.py sampleqc -x ont-ligation -o out_dir input_reads.fq
python longQC.py sampleqc -x ont-rapid -o out_dir input_reads.fq
inputEither fasta, fastq or PacBio BAM formatted file is required. Input file is expected to be ready for analysis or have at least 5x coverage.-oor--outputspecify a path for output-xor--presetspecify a platform/kit to be evaluated. adapter and some ovlp parameters are automatically applied. (pb-rs2, pb-sequel, ont-ligation, ont-rapid, ont-1dsq)
-tor--transcriptapplies the preset for transcripts-nor--n_sample-sor--sample_namesample name is added as a suffix for each output file.
-cor--trim_outputpath for trimmed reads. If this is not given, trimmed reads won't be saved.--adapter_5ADP5 adapter sequence for 5'.--adapter_3ADP3 adapter sequence for 3'.
-aor--accuratethis turns on the more sensitive setting. More accurate but slower.-por--ncputhe number of cpus for LongQC analysis-dor--dbmake minimap2 db in parallel to other tasks.
-mor--memmemory limit for chunking. Please specify in gigabytes. Default is 0.5 Gbytes. [>0 and <=2]-ior--indexGive index size for minimap2 (-I) in bp. Reduce when running on a small memory machine. Default is 4G.
--pbsample data from PacBio sequencers. this option will be overwritten by -x.--sequelsample data from Sequel of PacBio. this option will be overwritten by -x.--ontsample data from ONT sequencers. this option will be overwritten by -x.
Report files show traditional statistics such as length, GC content in json and html summaries. In addition, it summarizes coverage and the fraction of reads which has no coverage, we named this non-sense reads. If such fraction is a way high, it tells either 1) sequencing had some issues or 2) simply coverage is insufficient.
non-sense reads: this is similar to unmappable reads; however, mappability depends on references. For example, some contaminated DNA might not be mapped to the reference you intend, and it might be called unmappable. However, these reads would be mapped to a reference of their origin if it is available. non-sense reads are kind of artifact generated by sequencers or extremely errorneous data. Therefore, conceptually they cannot be mapped to any references.
| High dispersion case | Normal dispersion case |
|---|---|
This is actually not bad data, and that's why LongQC returns just warnings. The left PacBio data has a bit higher dispersion against the reference, and this is why median of depth fluctuates.
| Iso-seq | 1D^2 |
|---|---|
Due to the error rate, finding adapter (barcode too) sequences is not easy in the third gen data. Coverage analysis provides an overview of artifical sequences in flanking region. ONT data show different peak for ligation, rapid, and 1d^2 kit. In general, PacBio data shows no characteristic distribution, however, Iso-seq data shows weavy plots due to primer sequences in both terminals.
It is expected that a low quality dataset has high fraction of non-sense reads. The rationale is simple: highly erroneous read cannot be mapped to any other reads. The above example is the simulated data, and in theory all of them have origins in the reference.
| Length plot | GC content plot | Coverage vs length plot |
|---|---|---|
These are the plots for a high quality public data. The overall stats show that this is a nice one, but as you can see plots show multimodal distributions. Although this is a genome sequencing data of a plant genome, about 10% of reads were well mapped to a E.coli genome. It is noteworthy they were not mapped to even organelle genomes. Short unmappable reads range from 3k to 6k, and coverage for this length shows a spike (median for this bin is out of plot area).
SMRT Portal, SMRT Link and some third-party tools for ONT can provide similar plots and stats. This subcommands generates equivalent stuff for users who do not have access to such servers/tools.
minimap2 was originally developed by Heng Li and licensed under MIT. mix'EM was developed by Stefan Seemayer and licensed under MIT. Yoshinori slightly modified their codes. The LongQC codes are licensed under MIT.