Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views13 pages

Tutorial Raw

Uploaded by

KoustavRoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Tutorial Raw

Uploaded by

KoustavRoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MicrobiomeAnalyst 2.

0
Comprehensive statistical, functional and
integrative analysis of microbiome data

xialab@mcgill 2023-Mar-02
Tutorial for Raw Data Processing
Background
• Amplicon sequencing has enabled comprehensive profiling of microbial communities, bypassing
traditional wet lab culturing methods.

• Traditional Operational Taxonomic Units (OTU) picking methods work by clustering sequences
based on a similarity threshold (usually around 97%). However, this method tends to introduce
sequencing level errors into the reads due to the arbitrary clustering threshold.

• The Divisive Amplicon Denoising Algorithm (DADA) was introduced to improve the accuracy of
amplicon sequence variant (ASV) inference from high-throughput sequencing data.

• DADA2 uses a statistical model-based approach that corrects these incorporated errors and
infers higher quality and more accurate ASVs, which help improve our understanding of complex
and previously understudied microbial ecosystems.
Overview
• Goal: To provide a user-friendly web-based platform for the raw data processing of marker gene
sequencing data of microbial communities.

• Workflow:
Dereplication to filter
Filtration and
unique sequences Merging of forward Chimera removal to Taxonomy
trimming of reads Estimation of error
and denoising for and reverse reads by filter spuriously assignment from a
bases on quality rates
inference of overlapping. formed reads. chosen database.
profiles.
sequence variants.

• Data requirements:
o Demultiplexed individual fastq files with no primers or any other non-biological nucleotides.
o For paired-end data, the forward and reverse fastq files should have matching ordered names with “_R1”
for forward reads and “_R2” for reverse reads, as shown in the example data.
o Additionally, a metadata file indicating the groups is required to facilitate a streamlined input into the
other MicrobiomeAnalyst modules.

• Other considerations for paired-end data:


o What is the length of the forward and reverse reads? For e.g., 2x200bp
o What was the target region of the 16S rRNA gene that was sequenced and what were your primer
lengths? For e.g., V4, V3-V4, etc.
Data Upload:

Notes:
• You can choose to upload
multiple sequence files at once,
but please upload all files at a
Click “Select” to start time to avoid any potential
uploading your .zip/.fastg.gz exceptions caused by internet
files. connection issues

• A metadata file is necessary


for the downstream analysis
Proceed to the Data
Integrity Check.

Submit to try our example


here.
Data Integrity Check:
Each column gives information about the fastq
files submitted.

For paired-end data cross-


check that each forward read
has a corresponding reverse
read

Check if the
groups are
named
The corresponding correctly.
R script can be
downloaded from
here

Click proceed
This is the most critical step of the entire pipeline where the read
Parameter Settings: quality profiles need to be examined to determine the filtering and
trimming parameters.
Select the type of
marker gene used: Choose the cut-off length for the forward and reverse reads based on the
16S for bacteria, 18S quality profile (see below). This will truncate the reads to a maximum
for eukaryotes, and length, maintaining reads of uniform length which is important during
ITS for fungi. taxonomy assignment.

This is used to trim low quality


bases on the 5’ end (TrimLeft) and 3’
end (TrimRight).

MaxN determines the number of


ambiguous bases allowed. Typically, this is
by default=0 which means no ambiguous
bases would be allowed to pass through.
The expected errors cut- MinQ and TruncQ are used to
off in a read (default-2). respectively filter out bases below a min.
quality score and to truncate reads at the
first instance of quality drop, below the
specified score in the read. RemPhix
removes reads that match an Illumina
Select the database of choice control genome called Phix. This ensures
that only reads originating from the sample
for taxonomy assignment. pass through.
Quality control:
The quality score of the raw sequences can be viewed on the Parameter
Settings page to help adjust the parameters.

Typically any reads dropping below a


quality score of 30 are considered to be
low quality and are trimmed.

Forward reads tend to have better quality


profiles than reverse reads.

For the forward reads (left panel) the


quality drops off slightly at the end and
so we will set the forward trunc length
as 240.
For the reverse reads (right panel) the
quality drops off around 170 cycles and
so the reverse trunc length should be
set as 170.
Note: In order to ensure overlap of forward and reverse reads, the
trunc length parameters depend on the type of primer used. Refer
to the “other considerations section on slide 4.
Parameter optimisation:
• Do your results have very few reads passing through? Consider changing the following parameters:
o For multi-V-regions such as V3-V4, the overlap of merged reads is determined as follows:
o For 2x250bp, 16S-341F and 16S-805R primers of the V3-V4 region,
(forward read) + (reverse read) - (length of amplicon) = overlap
250 + 250 - (805-341) = 36
o If the forward read is truncated at 240 and reverse read is truncated at 150,
240 + 150 - 464 = -74 (No overlap!!!!)
o Thus the parameters should be adjusted accordingly to ensure an overlap of >20nt.
o For the V4 region, there is usually less variability and the parameters can be directly based off the quality
profiles.
o For more information visit- https://forum.qiime2.org/t/merging-quality-control-and overlapping/12618/2ps://forum
.qiime2.org/t/merging-quality-control-and-overlapping/12618/2
• Do you still find very few reads passing through? Consider increasing the Max EE parameter which would allow
less stringent filtering, especially for reverse reads. E.g.: Max EE of reverse= 5

• Is the percentage of chimera removal >25%? Check if all non-biological nts such as adapters and primers were
removed properly. Consider trimming your sequences more using the Trim parameters. If the chimera removal is
still high but the number of reads passing through are sufficient, you could consider moving ahead with the
results. More information - https://forum.qiime2.org/t/loss-of-reads-after-dada2-as-chimeras/9503/2
Job Status Tracking:

The job may take some time to


complete, so click “Create
Track the processing Bookmark URL” to save the job
status here. The job link to check the job status at a
status will update later time.
here in real-time.

Note: Keep only one


active web page open.
Multiple tabs/windows will
interfere with each other,
Once the job is
leading to unpredictable completed, click
results proceed.
Track reads through the pipeline
Result:
Summary of denoising
Take a look at the % of
and chimera removal
chimera removal. Refer
results.
to the “parameter
optimization” slide if
this is >25%.

Check taxonomy annotation here. It is common to have


lesser assignment at the Species level with 16S sequencing.
Input files for
MDP module of
MicrobiomeAna
lyst.

Click here to directly


go to the maker data
profiling module for
downstream
analysis
The End
For more information, visit Tutorials, Resources
and Contact pages on www.microbiomeanalyst.ca
Also visit our forum for FAQs on www.omicsforum.ca

You might also like