The S3 xSeries class attempts to provide a convenient R representation of a
typical RNA-Seq output consisting of a series of sequencing runs, usually each
of them referring to a different RNA sample, from a different biological source.
Building on this, the S3 xModel class addresses the need to group several
independent scientific studies for the purpose of meta-analysis or reanalysis.
To describe this hierarchical data structure, we will use a terminology
primarily modeled after the one adopted by the partners of the International
Nucleotide Sequence Database Collaboration (INSDC) in the context of the
Sequence Read Archive (SRA).
See here for a
friendly explanation of the INSDC and the related SRA data model.
- Run: each sequencing run directly generating a single FASTQ file (or
file pair in the case of non-interleaved PE reads). Within the INSDC
framework, Run metadata objects are referred to by an accession following
the pattern
(E|D|S)RR[0-9]{6,}. - Sample: the set of all Runs from the same biological RNA sample.
The corresponding INSDC metadata objects are Samples (accession
(E|D|S)RS[0-9]{6,}) or, equivalently, BioSamples (accessionSAM(E|D|N)[A-Z]?[0-9]+). Even GEO Sample IDs (accessionGSM[0-9]+) can be found if data were originally brokered to INSDC’s SRA by GEO. - Series: the set of all Samples/Runs pertaining to a given
biological research project or experimental design, across all conditions of
interest to that particular study. This is roughly what INSDC refers to as
Study (accession
(E|D|S)RP[0-9]{6,}) or, equivalently, BioProject (accessionPRJ(E|D|N)[A-Z][0-9]+). If data were originally brokered by GEO or ArrayExpress, GEO Series IDs (accessionGSE[0-9]+) or AE Experiment IDs (accessionE-[A-Z]{4}-[0-9]+) will be also present in SRA DBs and can be used as Study alias accessions, respectively. - Model: a collection of Series dealing with the same biological model. Since this object is characteristic of the meta-analytic level, which by definition involves multiple independent studies, it has no counterpart in INSDC.
Note
Most of the times, Runs and Samples are the same thing, however it could happen that a single RNA Sample is sequenced through multiple Runs (also referred to as 'technical replicates'). Thus, the only ID that is ensured to be unique on a per-file basis (even in the case of technical replicates), is the Run accession. For this reason, it will be used here as the base reference for the construction of xSeries and xModel data structures.
SeqLoader implements S3 xSeries and xModel classes to provide an
integrated representation of RNA-Seq data (i.e., read counts), gene annotation,
and Sample/Run metadata for Series and Models, respectively.
Both new_xSeries and new_xModel SeqLoader constructors assume that
RNA-Seq low-level analysis (i.e., read alignment and transcript abundance
quantification) has already been performed and that the starting point is a
matrix of raw or normalized counts and their related metadata. This means that
xSeries and xModel objects can only be constructed from files containing
such information and not by direct data input.
With the exclusion of a few methods designed to filter and subset structures
([...], pruneRuns, keepRuns), all the other methods of the class access
the data structures in read-only mode, leaving the object completely unchanged.
SeqLoader is indeed designed to facilitate high-level analysis operations
(i.e., downstream of the count matrix) such as data representation or
descriptive or inferential statistical analysis, in both absolute and
differential expression contexts.
These S3 objects are constructed from a gene expression matrix and a metadata
table related to a single study. Within the expression matrix, each row
represents a gene (or transcript), while each column refers to a different
sequencing Run, with the possible addition of one or more columns for gene
annotations (e.g., gene symbol, gene name, gene type, ...). In contrast, the
metadata table is assumed to dedicate one row for each Run and one column for
each metadata of interest, among which a mandatory field for Run accessions.
So, xSeries objects combine information from a typical gene expression matrix
(possibly including gene annotations) with metadata about individual Runs.
More technically, an xSeries object is a named list containing one element for
each Run of the study in question, and an additional element for gene
annotations (which are common to all Runs). Each Run element is itself a
list containing relevant metadata for the specific Run, and a gene data
frame with raw or normalized counts for each gene. The annotation element, on
the contrary, is a simple data frame (see tree graph below). Notice that,
Samples are not explicitly represented in xSeries objects. However, a method
will be implemented to collapse Runs to the Samples they belong in the case
of technical replicates (see issue #4).
Important
For the purpose of a data re-analysis or meta-analysis, one may not
necessarily be interested in all the samples or experimental conditions
included in the the original study. Rather, just a selection of them is
usually considered. For this reason, recomputed count matrices could feature
less columns (i.e., Runs) compared to the number of rows within the complete
metadata table. By construction, xSeries objects are made up of all Runs
for which there exist a metadata entry (i.e., usually all the Runs from the
original study), but only the ones included in the loaded count matrices will
also feature a counts column in the genes data frame. In any case, one can
always use the pruneRuns method to easily remove all the Runs with no
count data from xSeries or xModel objects, reducing them to the sole
columns actually present in the count matrix used for object construction.
Structurally speaking, these S3 objects are just collections (lists) of many
xSeries related to the same biological model. Rather, the focus here is on
functions. Methods implementing generics for this class are primarily meant to
aid in meta-analytic data synthesis (see geneStats in particular).
Here is a graph of a generic xModel made up of n xSeries. The n-th
Series consists of m Runs and the gene annotation element. Each Run
contains a gene dataframe for the counts and an arbitrary number of
metadata, including the mandatory ena_run ID.
Model : xModel
│
├──$ Series_1 : xSeries
├──$ Series_2 : xSeries
├──$ ...
└──$ Series_n : xSeries
│
├──$ Run_1 : list
├──$ Run_2 : list
├──$ Run_3 : list
├──$ ...
├──$ Run_m : list
│ │
│ ├──$ ena_sample_title : chr
│ ├──$ geo_series : chr
│ ├──$ geo_sample : chr
│ ├──$ ena_project : chr
│ ├──$ ena_sample : chr
│ ├──$ ena_run : chr
│ ├──$ read_count : int
│ ├──$ library_layout : chr
│ ├──$ extra : int
│ ├──$ ...
│ └──$ genes : data.frame
│ ├──$ IDs : chr
│ └──$ counts : num
│
└──$ annotation : data.frame
├──$ IDs : chr
├──$ SYMBOL : chr
├──$ GENENAME : chr
├──$ GENETYPE : chr
└──$ ...
A number of (hopefully reasonable) assumptions upon file and data organization are made for object construction to be successful.
- Each Series (or Study) is represented by two
CSVorTSVfiles, containing respectively the read counts (along with possible gene annotation) and metadata for all the Runs making up the Series. - All Series related to the same Model are stored in the same directory, here referred to as the target directory for xModel construction. Consistently, each Model is expected to live in a separate target directory.
- Both data and metadata files have names starting with the same Series
accession ID (any of the INSDC BioProject, ENA Study, or GEO Series format
is fine), followed by an underscore (
_) and either the distinctive stringCountMatrixormeta(ignoring case) if they are counts or metadata, respectively. Any additional characters after this pattern are allowed, as long as the terminal filename extension is eitherCSVorTSV(again ignoring case). Examples of well-formed filenames are:GSE205739_CountMatrix_genes_TPM.tsv PRJNA847413_countMATRIX.CSV PRJNA847413_meta.csv SRP379268_MeTaDaTa.TSV - Metadata tables have proper column names as header, among which only
ena_runis mandatory, as the only ID that is certainly unique to each FASTQ file in the original study (i.e., Series). - Count tables have proper column names as header, including a mandatory gene
ID column matching the regex
gene.*id|transcript.*id|ENSEMBL|ENSEMBLTRANS(usually beinggene_idortranscript_id). - Within the count table, each count-containing column is named using the
corresponding
ena_runID in the header (along with other possible text strings). - Input counts from
CountMatrixfile can be either raw or normalized counts, but in any case they are supposed to be in linear scale (not log-transformed)
Tip
Most of these requirements are automatically met if the low-level RNA-Seq data analysis is performed using x.FASTQ.