Antonescu 2010
Antonescu 2010
Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Published in final edited form as:
NIH-PA Author Manuscript
Abstract
The DFCI Gene Index Web pages provide access to analyses of ESTs and gene sequences for
nearly 114 species, as well as a number of resources derived from these. Each species-specific
database is presented using a common format with a home page. A variety of methods exist that
allow users to search each species-specific database. Methods implemented currently include
nucleotide or protein sequence queries using WU-BLAST, text-based searches using various
sequence identifiers, searches by gene, tissue and library name, and searches using functional
classes through Gene Ontology assignments. This protocol provides guidance for using the Gene
NIH-PA Author Manuscript
Keywords
gene index database; gene index; databases; DFCI
INTRODUCTION
The DFCI Gene Index Web pages (http://compbio.dfci.harvard.edu/tgi/tgipage.html; Fig.
1.6.1) provide access to analyses of Expressed Sequence Tags (ESTs) and gene sequences
for over 114 species, as well as a number of resources derived from these. A summary of the
species currently represented can be found in the Appendix at the end of this unit; additional
species are regularly added to the collection based on the availability of EST data and user
requests. Each species-specific database is presented using a common format with a home
page similar to that shown in Figure 1.6.2. A variety of methods exist, listed immediately
NIH-PA Author Manuscript
below the heading “Search the Index by,” that allow users to search each species-specific
database. Methods implemented currently include searching of nucleotide or protein
sequences using WU-BLAST (see Basic Protocol 1), text-based searches using various
sequence identifiers (GenBank Accessions and Tentative Consensus (TC) identifiers),
searches by tissue and library names and gene names, and searches using functional classes
through Gene Ontology assignments (UNIT 7.2). In addition, a comprehensive annotation of
all ESTs in the database, based on the annotation of the TCs in which they are contained, is
provided.
The Eukaryotic Gene Ortholog database (EGO; see Basic Protocol 3), which uses DNA
sequence–based comparisons to identify tentative ortholog pairs by linking across the
various Gene Index databases, also provide a means of entry. In addition to providing for
sequence-, accession-, and gene name–based searches, the DFCI Gene Index is also cross-
referenced to the Online Mendelian Inheritance in Man (OMIM) database (UNIT 1.2),
Antonescu et al. Page 2
allowing users to link to Tentative Ortholog Groups (TOGs), and from there to
representative sequences in the individual gene index databases. RESOURCERER, designed
to annotate and cross-reference mammalian orthologs, as well as the Genome Viewers, also
NIH-PA Author Manuscript
The Gene Index Databases are constructed within a species-specific framework, and users
should keep this in mind while using this resource. Although some general search utilities
exist (such as BLAST searches; see Basic Protocol 1), most searches begin with a selection
of a target species (see Alternate Protocols 1 to 5). Each species has a distinct home page
that can be reached through a URL of the form http://compbio.dfci.harvard.edu/tgi/cgi-
bin/tgi/gimain.pl?gudb=xxxx, where xxxx is the appropriate code from the “Common name”
column in Table 1.6.2 (see the Appendix at the end of this unit). Within the Gene Index for
each species, the primary resources available are detailed reports for each of the component
sequences, including the assembled TCs and the individual ESTs, as well as expressed
transcripts (ETs), which are typically annotated CDS features in GenBank records. In most
of the following protocols, the Maize Gene Index will be used as an example; similar tools
and pages exist for the other databases, although the appropriate gene index name must be
substituted in the queries (see Table 1.6.2 for the full list).
NIH-PA Author Manuscript
The completion of a number of eukaryotic genomes provides the opportunity to search the
Gene Index databases by their physical location. A list of available genomes can be found
by following the Genomic Maps link on the DFCI home page to the mapping page, http://
compbio.dfci.harvard.edu/tgi/map.html. A detailed guide to doing such searches is provided
below (see Basic Protocol 2).
TCs from one species can also be found through the mapping of possible orthologs. The
Eukaryotic Gene Ortholog (EGO) database catalogs tentative ortholog groups based on
shared DNA sequence using pairwise reciprocal best matches between species. Details on
using this resource are also included in the unit (see Basic Protocol 3).
The protocols below provide examples of ways to use the Gene Index Databases to extract
and explore the information they provide. The examples are not meant to be exhaustive, but
rather illustrative. Users should note that new features and new species are continuously
being added and that updated versions of these databases are released every four months
NIH-PA Author Manuscript
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 3
1. Open the BLAST search page (Fig. 1.6.3) in the DFCI Gene Indices Web site by
NIH-PA Author Manuscript
b. Click on the BLAST link under the “Sequence Similarity Search” heading
on the DFCI Maize Gene Index home page (http://
compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gimain.pl?gudb=maize; Fig.
1.6.2) or the corresponding home page for another species.
2. From the Program pull-down menu, select the search program to run: BLASTN
(UNIT 3.3) for a nucleotide query sequence or TBLASTN (UNIT 3.4) for a protein
query, which will be searched against the six-frame translation of the appropriate
NIH-PA Author Manuscript
SAGE10 and SAGE14, also included in the Program pull-down menu, are
BLASTN searches using parameters optimized to search SAGE tags 10
and 14 nucleotides in length, respectively.
3. From the Database pull-down menu, select an appropriate target database; one or
more databases can be specified at each time by holding down the Control key
while clicking within the list.
The TGI BLAST server does not presently allow multiple sequences to be
searched simultaneously, although such a utility is under development.
Note that although there is no a priori limit on sequence length, some
browsers may time out during searches of long sequences.
5. Users may also select the options other than the defaults for various parameters,
including Alignments (using the pull-down menu right below the Program pull-
down menu), and Matrix, Filter, Expect, Cutoff, Strand, Descriptions, Wordlength,
Echofilter, Graphical Overview, and Ignore Hypotheticals (using the pull-down
menus near the bottom of the page).
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 4
6. Users are also provided with the option of supplying an e-mail address where they
will be notified when the search is completed.
7. Standard BLAST search results are returned with alignments. Hyperlinks have been
added to each of the identified target sequences. Target sequence names are
specified in one of three formats depending on their source:
For TC:
For EST:
species|est name
NIH-PA Author Manuscript
For ET:
8. Click on the name of the target sequence to retrieve the corresponding TC report.
Review the selected TC report (see Guidelines for Understanding Results).
Necessary Resources
NIH-PA Author Manuscript
Software—Web browser
1. Starting from a species home page (e.g., Fig. 1.6.2), click on the “Identifiers or
Keywords” link to open the search page.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 5
2. For the identifier chosen, complete the appropriate entry on the form and click the
GO button. Be aware that each of the three types of identifier has a slightly
different specification:
NIH-PA Author Manuscript
For TC:
For ET:
GB# can be either the GenBank accession of a sequence containing an
annotated CDS, or the corresponding GenPept protein sequence accession.
NP#, the DFCI accession for each CDS feature parsed from GenBank records,
can be of the form NPxxxxx where the HT/ET designators are used to maintain
continuity with the DFCI qcGene database (http://
compbio.dfci.harvard.edu/tgi/qcGene.html).
For EST:
3. For TC number searches, the standard TC report (see Guidelines for Understanding
Results) is returned. For ET and EST searches, the search provides a sequence
report page, with links to the relevant TC report if the ET or EST sequence is not a
singleton.
The search returns a table with information about the query sequence,
including links to the TC in which that sequence is contained.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 6
Gene Ontology (GO; Ashburner et al., 2000; UNIT 7.2) terms provide classification for
proteins based on three classes: Molecular Function, Biological Process, and Cellular
Component. GO terms and Enzyme Commission (EC) numbers are assigned to TCs by
using BLASTX (UNIT 3.4) to compare the sequence to the SwissProt database and then
using a SwissProt-to-GO translation table provided by the GO Consortium (http://
www.geneontology.org). For inexact matches, the DFCI Gene Indices are conservative and
assign more general terms so as to avoid misclassification. It should be noted that because
GO is evolving, many genes and TCs have not as yet been assigned precise classifications.
The GO terms can be used within any species to find those TCs likely to have a specific
function or to be involved in particular processes.
Necessary Resources
Hardware—Computer with Internet access
NIH-PA Author Manuscript
Software—Web browser
1. Starting from a species home page (e.g., Fig. 1.6.2), click on the Gene Ontology
link under the “Functional Annotation and Analysis” heading to open the Gene
Ontology Assignments page.
Each of the species represented in the TGI has assigned GO terms (UNIT
7.2), and the GO assignments are summarized on a page accessible at a
URL of the form http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/
GObrowser.pl?species=Maize&gidir=zmgi (see Fig. 1.6.5). This page lists
the number of TCs with each class and a bar graph shows the fraction of
all TCs and those with GO assignments falling into each class.
2. Clicking on the functional category of interest, such as “reproduction” brings up a
GO browser (Fig. 1.6.6), which shows both of the subclasses that fall into that
category. Each line includes the current level, child ids, child GO terms, the
number of TCs at this level, and the number of subtree TCs. Clicking on underlined
NIH-PA Author Manuscript
entries in the last two columns brings up a list of TCs within that classification
along with EC numbers linked to the KEGG metabolic pathway database (Kanehisa
and Goto, 2000).
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 7
mapped transcript set, the ECs are mapped to the appropriate genome using e-PCR (Schuler,
1997) and the marker data available from a variety of sources (see http://
compgen.rutgers.edu/EnhancedMaps/Default.aspx for a summary). The following method
NIH-PA Author Manuscript
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
2. Select the chromosome to view and set the number of records to be displayed on
each page.
3. Figure 1.6.8 shows the resulting table containing columns for TC#, Marker, 5
NIH-PA Author Manuscript
marker position in TC, 3 marker position in TC, Panel, Chromosome location, and
P-value (from the RH map).
Here, the 5′ and 3′ positions refer to where the mapped RH marker falls
within the mapped TC. Users will be most interested in examining the RH
map location, which provides relative coordinates on the chromosome.
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
1b Alternatively, starting from a species home page (e.g., Fig. 1.6.2), click on the
“Libraries” link under the “Sequence Reports” heading. Figure 1.6.9 shows an
example from maize.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 8
2a To identify TCs in a given tissue: In the top section of the page (see the “Search
for Tissue Specific Transcripts” section in Fig. 1.6.9), specify a tissue or organ
of interest and a minimum percentage for representation of ESTs from that
NIH-PA Author Manuscript
For example, specifying “root” and 50% will return all TCs in which
more than 50% of the ESTs are from root. Clicking on the Search
button returns a table formatted to include:
2nd column: the number of ESTs from specified tissue or organ and the
total number of ESTs within that TC.
5th column: the number of ESTs from each specific library within this
TC.
6th column: the number of ESTs from component libraries for all TCs.
2b To identify TCs associated with a keyword: In the upper middle section of the
page (see the “Search cDNA Libraries by Keyword” section in Fig. 1.6.9), enter
one or more keywords.
A list of all libraries annotated with those terms is returned, with links
to the appropriate library reports.
2c To identify TCs associated with library identifiers: In the lower middle section
of the page (see the “Search cDNA Libraries by Library Identifier” section in
Fig. 1.6.9), enter the library identifier.
Users can also retrieve library reports by searching the Gene Index
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 9
This produces a list of libraries from which the user can select those of
interest (Fig. 1.6.10).
NIH-PA Author Manuscript
Users should note that tissue designations come from the library
annotation provided in GenBank records, and, as such, the same tissue
may be represented by different tissue terms. Users can therefore select
multiple tissues for each of the two groups they wish to compare.
Clicking on the “Get Expression” button returns a graphical matrix
representation of expression in which each row represents a TC and
columns represents the R stat, TC#, the number of ESTs in that TC, the
number of ESTs found in libraries selected in group A, and the number
of ESTs found in libraries selected in group B (Fig. 1.6.11).
The results illustrated in Figure 1.6.11 were obtained by selecting tissue
type “aerial, root, whole plant” for Group A, and “aleuron layer” for
group B.
Significant differential expression is identified using the “R statistic”
(Stekel et al., 2000); a large R for a TC indicates that there is a
significant bias toward one or more libraries in that TC.
NIH-PA Author Manuscript
A–B contains all TCs with more than one EST in A and zero or one
EST in B.
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
1. Starting from the appropriate Gene Index home page (e.g., Fig. 1.6.2), select
“Metabolic Pathways” link under the “Functional Annotation and Analysis”
heading to produce a graphical representation of a number of metabolic pathways.
NIH-PA Author Manuscript
BASIC PROTOCOL 2: USING THE GENOMIC MAPS WITH THE DFCI GENE
INDICES
Completed or draft genome sequences are now available for a number of eukaryotic species,
including Anopheles gambiae, Bos taurus, Caenorhabditis elegans, Danio rerio, Drosophila
melanogaster, Homo sapiens, Gallus gallus, Macaca mulatta, Mus musculus, Pan
troglodytes, Rattus norvegicus, Arabidopsis thaliana, Oryza sativa (rice BACs from the
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 10
approximately localized within relevant genomes using MegaBLAST or BLAT, with final
alignments performed using gap2, which incorporates splicing rules and is optimized for
transcript-to-genome alignments. Mapping information is stored in a relational database and
used to create user-friendly Web displays. Table 1.6.1 lists the genomes currently
represented and the Gene Indices that are mapped to each genome.
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
1.6.1).
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 11
orthologous TCs, as well as an indexed list of TCs linked to disease-associated human genes
through the Online Mendelian Inheritance in Man (OMIM; UNIT 1.2) database. EGO is
based on the results of high-stringency pairwise sequence comparisons between the TCs and
singleton ETs from all TGI databases. Tentative Ortholog Groups (TOGs) are constructed
using a transitive, reflexive closure process based on the assumption of parsimony to
associate sequence-specific best hits, with the requirement that three sequences from
separate species must be represented. Some TCs may belong to multiple TOGs, although
TOGs containing significant overlap in their membership are merged. The result is that in
some instances paralogous sequences appear in the same TOG, particularly if a sequence
from a primitive eukaryote such as yeast is represented in the TOG. Each TOG is assigned a
unique accession number (TOG #) that can be used to reference the collection of sequences.
EGO has been a valuable tool for identifying orthologs of known genes as well as those
existing only as uncharacterized ESTs.
Necessary Resources
NIH-PA Author Manuscript
Software—Web browser
3.3 & 3.4), using TOG numbers, using gene names, or using TCs from any of the
species within EGO. The next page will be a list of orthologs. The title (Tentative
Ortholog xxxx) is a link to a more detailed report. A representative TOG report for
a putative transcription factor gene is shown in Figure 1.6.14A,B.
3. The Orthologs of Human Disease Genes link from the EGO home page allows
searches by Omim Identifier, OMIM Locus ID, Gene name (such as CDK2, cyclin-
dependent kinase 2) GenBank Accession number, DFCI Accession Number (for
human only), or EGO Identifier.
4. TOG reports have three main parts as shown in Figure 1.6.14 (A,B). At the top is a
table listing the component TCs with putative annotation and links to the
component TC reports. There is also a graphic representing the connections
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 12
between the component sequences used for constructing the TOG. Below the TOG
is a table listing the results of all pair-wise searches contributing to the TOG, with
percent identity, match length, p-value, and asterisks marking reciprocal best hits.
NIH-PA Author Manuscript
sets, long oligo sets from Operon/Qiagen and Compugen/Sigma, and the Affymetrix
GeneChips for representative species.
Necessary Resources
Hardware—Computer with Internet access
Software—Web browser
Either the Rearray ID assigned by the clone set developer or the Affymetrix
Probe ID, as appropriate
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 13
UniGene IDs
NIH-PA Author Manuscript
Physical map location based on alignments of the DFCI TCs to the appropriate
draft genome sequence
Putative annotation
RESOURCERER page (Fig. 1.6.15), simply select a first resource as Data Set A
and the “Compare to another resource” radio button. On the next page, you will
select the second source—Data Set B. Also, select whether the comparisons should
be made through EGO (and the TGI), which returns valid comparisons either
within a single species or between species, or UniGene IDs, which is only valid for
comparisons within a single species. Finally, select the type of comparison:
whether the search should return those elements in common to both Data Set A and
Data Set B (Intersection), those unique to Data Set A (A unique), or those unique to
Data Set B (B unique).
Clicking the Get Table button returns a cross-reference table that contains
annotations similar to that shown in Figure 1.6.17, again with appropriate
links to other databases.
Examining a TC Report
The TC sequences are the central elements in the DFCI Gene Index databases. The TCs are
assembled from EST and gene sequences and represent likely transcripts encoded within a
particular genome. In that sense, the TCs are distinct from clusters in other approaches such
as UniGene in that alternative splice forms and gene family members are more likely to be
represented by separate objects in our databases. This has some advantages and
disadvantages based on one’s application of interest. In principle, with a large-enough
collection of ESTs, sequences representing a wide variety of tissues, developmental stages,
and disease states, the Gene Index databases would reconstruct the entire transcriptome of a
particular organism.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 14
Figure 1.6.18A,B,C shows a representative TC with the annotation and features provided in
each Gene Index. TCs are indexed by an accession of the form TCyyyy, where yyyy is a
number that is simply a sequential identifier assigned each time each database is rebuilt. For
NIH-PA Author Manuscript
each species-specific database, TC numbers are never reused. However, TC numbers are
tracked through subsequent builds so that users with a TC number from a previous release of
the database can get the current representation of that particular transcript.
TC reports can be accessed in a variety of ways, many of which are detailed below.
However, users can link directly to any TC report by entering a URL of the form http://
compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/tcreport.pl?species=xxx&tc=yyyy, where xxx is the
common species name (typed exactly as in the table, without spaces) from the first column
of Table 1.6.2 (no spaces) and yyyy is the TC number of interest. If a TC number from a
previous build is used, the corresponding TC from the current build is provided. Note that
for human, TC is replaced by THC.
3. Below the predicted ORFs is a map representing the individual sequences that
NIH-PA Author Manuscript
comprise the TC, showing their approximate position in the TC and their relative
lengths. Each sequence is represented by an arrow showing orientation, and paired
reads from the same clone are linked by a dotted line. Annotated mRNA sequences
are highlighted in pink. All sequences are numbered and indexed to a table of
linked identifiers, which immediately follows the map (Fig. 1.6.18A).
4. A table lists the individual sequences comprising the TC, indexed by numbers
appearing in the sequence map (Fig. 1.6.18B). Each row in the table represents a
particular EST or gene sequence and these are annotated with a source laboratory
(wherever possible), a sequence ID, a GenBank accession, clone name, the 5′
position in the TC, 3′ position in the TC, and source library annotation. Wherever
possible, these entries are linked to other databases or sources of information. The
sequence ID is linked to an EST report or ET report page at DFCI, the GenBank
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 15
accession is linked to a sequence record at NCBI, and the clone name, wherever
possible, is linked to a public clone repository. Immediately following the list is a
key to the clone source codes used, showing the laboratories from which the clone
NIH-PA Author Manuscript
5. Assembling the TCs can produce consensus sequences with arbitrary orientation.
Using annotated information about the component sequences, including the
presence of mRNA sequences and the 3′ and 5′ orientation of the ESTs, one
attempts to identify the appropriate orientation of the TC and provide the evidence
used for that determination (Fig. 1.6.18B).
6. Alternative splice forms, identified through alignment of TCs within each TGI
database, can be found by clicking on the “Alternative Splice Forms” button (when
info is available).
7. An expression summary, based on the libraries from which the ESTs were derived,
can be found by clicking on the Expression Summary button (Fig. 1.6.18C).
NIH-PA Author Manuscript
8. Putative gene identification is made using a variety of methods. First, TCs are
annotated using the names associated with any mRNA sequences they contain; this
is listed as the Putative ID for each TC. The consensus sequences are also searched
against a nonredundant protein database; the top five hits are listed and a controlled
vocabulary is used with these to assign a name to each (Fig. 1.6.18C).
9. The “GO annotation” lists assignments based on the Gene Ontology project’s
classifications (http://www.geneontology.org; UNIT 7.2). TCs are searched against
SwissProt and SwissProt-to-GO tables to provide conservative assignments based
on the level of sequence homology (Fig. 1.6.18C).
10. Potential orthologs are identified through the EGO database. A detailed description
of the EGO database is provided in Basic Protocol 3.
COMMENTARY
Background Information
The goal of any genome project is the identification and functional characterization of the
entire catalog of the genes encoded within a particular genome. Although genome
sequencing projects in human, mouse, Arabidopsis, and other eukaryotic species have
generated a wealth of data, identification of the genes encoded in the sequence and
assignment of function to these remains a significant challenge. Nowhere is that more
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 16
apparent than in the two completed drafts of the human genome (International Human
Genome Sequencing Consortium, 2001; Venter et al., 2001), where an independent analysis
of the competing annotations has found that many of the gene predictions, other than for
NIH-PA Author Manuscript
previously known genes, are disjoint (Hogenesch et al., 2001). Indeed, it is becoming
increasingly clear that the completion of a genome sequence is only a starting point and that
significant additional analysis is required before one can declare its annotation, and the
genome itself, complete.
The sequencing of ESTs continues to supply important insight into the transcribed genes in a
wide variety of species and has become a widely used approach to gene discovery and the
analysis of gene expression. ESTs are the most extensive available survey of the transcribed
portion of the eukaryotic genomes; there are currently more than 10,000,000 ESTs in
GenBank, nearly 45% of which are human and 75% of which represent higher mammals
(human, mouse, rat, cattle, and pig; http://www.ncbi.nlm.nih.gov/dbEST/
dbEST_summary_html). For many species, ESTs remain the primary source of gene
sequence data and provide a basic survey of gene expression in various tissues, as well as in
various developmental and disease states. ESTs have also proven their value in genome
annotation as they provide experimental evidence for the presence of the genes, their
NIH-PA Author Manuscript
Obtaining data from the DFCI Gene Index databases through FTP—As an
alternative to using the Web site, flat-file versions of all of the DFCI databases are available
through FTP links on each Gene Index home page and the EGO page; RESOURCERER flat
files can be downloaded through the Web site. The Gene Index download files include:
1. A multi FASTA file with TC sequences (annotation in the defline) and singleton
sequences.
2. A FASTA file, containing the complete set of TC sequences for that species with
TC identifiers from previous builds in the definition line.
3. A tab-delimited file containing the TC identifiers and the ESTs that comprise them.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 17
The FASTA files can be used to create local BLAST databases, or used for other
purposes. The file that includes TC numbers and the list of ESTs can be used for
linking ESTs to the TCs that contain them.
NIH-PA Author Manuscript
Assembly of the Gene Index databases—The DFCI Gene Indices are assembled
independently for each species using a “divide and conquer” approach in which ESTs are
first placed in clusters based on sequence similarity and then assembled on a cluster-by-
cluster basis to produce Tentative Consensus (TC) sequences (Liang et al., 2000;
Quackenbush et al., 2001). A schematic overview of this process is shown in Figure 1.6.19
and a software implementation of the clustering and assembly tools used, TGICL, is freely
available (Pertea et al., 2003; http://compbio.dfci.harvard.edu/tgi/software/). TGICL is an
open-source pipeline for analysis of large EST and mRNA databases, in which sequences
are first clustered based on pairwise sequence similarity and then assembled to produce the
TC sequences.
NIH-PA Author Manuscript
Briefly, ESTs and coding gene sequences are first downloaded from dbEST and parsed from
GenBank records. The annotated CDS features in GenBank records are assigned NP (for
nucleotide-protein) identifiers to provide a unique accession for each coding DNA sequence;
some GenBank records have multiple annotated coding features. Sequences are trimmed to
remove vector, poly(A/T) tails, adaptor sequences, and contaminating bacterial sequences.
Clustering begins by indexing a multi-FASTA-formatted sequence database and performing
all-versus-all pairwise similarity searches. The authors use mgblast, a modified version of
the megablast program (Zhang et al., 2000), for this purpose. The mgblast program differs
from the original megablast program in that it produces a simple tab-delimited output, uses
specific output filtering options such as minimum overlap length and identity, and allows the
use of a dynamic offset within the database when performing incremental searches of
portions (slices) of the database against itself. Each line in the mg-blast output represents
one identified overlap between two sequences in the database. The search results are sorted
in order by decreasing pairwise alignment score. The sequence overlaps are filtered using
user-defined criteria: the minimum overlap length (default 40 base-pairs), the minimum
NIH-PA Author Manuscript
percent identity for the overlap (default 95%), and the maximum mismatched overhang
allowed around the overlap (dynamically adjusted for long sequences and long overlaps; the
default value starts at 30 nucleotides). Based on the results of these similarity searches,
sequences are grouped into clusters using a transitive closure approach and a graph
representation in which the sequences are the graph nodes and the alignments represent
edges (Pertea et al., 2003); the resulting clusters represent the connected subgraphs within
the dataset.
This clustering stage is an important step if one then wants to assemble the expressed
sequences to reconstruct the transcripts they represent. Most sequence-assembly programs
were developed for genomic applications and face particular difficulty in dealing with the
challenges presented by ESTs, including extremely deep and uneven coverage from diverse
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 18
biological sources, low-quality sequences often without quality scores, relatively frequent
chimerism, and a moderately high rate of vector and adapter contamination. Further, while
many DNA sequence assembly programs assemble contigs from large numbers of
NIH-PA Author Manuscript
sequences, they can easily be overwhelmed by a very large unpartitioned dataset and
produce incorrect chimeric assemblies (Liang et al., 2000).
Makalowski and Boguski (1998) conducted what was at the time the most comprehensive
survey of eukaryotic orthologs available. Their dataset contained 1880 rodent-human
ortholog pairs and 470 sequences shared by all three species. Their analysis of both the
coding and noncoding regions indicated that not only are both the DNA and protein coding
regions highly conserved in mammals, but, more surprisingly, that the flanking 5′ and 3′
noncoding regions are extremely well conserved and that the evolutionary distance
estimated for the 5′ and 3′ UTRs are similar and generally indistinguishable from that for
synonymous coding sites. This suggested to the authors of this unit that EST sequences,
derived primarily from the 3′ UTR, could be used to identify orthologs in closely related
species. Based on this observation, and the fact that the TC sequences within the DFCI Gene
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 19
Index databases represented the most comprehensive survey of eukaryotic gene sequences
available at the time, the authors began construction of the Eukaryotic Gene Ortholog (EGO;
Lee et al., 2002) database in 1999. EGO has allowed identification and cataloging of more
NIH-PA Author Manuscript
than 86,630 tentative orthologous groups in eukaryotes and it provides a tool for cross-
referencing other genomic resources, including commonly used resources for DNA
microarrays (Tsai et al., 2001).
identifying corresponding array elements between species and platforms. To address these
issues, the authors of this unit developed RESOURCERER (http://
compbio.dfci.harvard.edu/tgi/cgi-bin/magic/p1.pl), a utility designed to provide annotation
for and comparisons between widely used microarray platforms. RESOURCERER provides
information for the most widely used microarray mammalian gene resources, including the
Research Genetics Sequence Verified Human cDNA clone set, the BMAP and NIA mouse
clone sets, the DFCI Rat Gene Index cDNA collection, human and mouse 70-mer oligo sets
from Operon, and the Affymetrix Human, Mouse, and Rat GeneChip sets.
Literature Cited
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS,
Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE,
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 20
Ringwald M, Rubin GM, Sherlock G. Gene ontology: Tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet. 2000; 25:25–29. [PubMed: 10802651]
Boguski MS, Schuler GD. Establishing a human transcript map. Nat Genet. 1995; 10:369–371.
NIH-PA Author Manuscript
[PubMed: 7670480]
Cariaso M, Folta P, Wagner M, Kuczmarski T, Lennon G. IMAGEne I: Clustering and ranking of
I.M.A.G.E. cDNA clones corresponding to known genes. Bioinformatics. 1999; 15:965–973.
[PubMed: 10745985]
Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W. STACK: Sequence Tag
Alignment and Consensus Knowledgebase. Nucleic Acids Res. 2001; 29:234–238. [PubMed:
11125101]
Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970; 19:99–113.
[PubMed: 5449325]
Hatzigeorgiou AG, Fiziev P, Reczko M. DIANA-EST: A statistical analysis. Bioinformatics. 2001;
17:913–919. [PubMed: 11673235]
Hogenesch JB, Ching KA, Batalov S, Su AI, Walker JR, Zhou Y, Kay SA, Schultz PG, Cooke MP. A
comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes.
Cell. 2001; 106:413–415. [PubMed: 11534548]
Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999; 9:868–877.
[PubMed: 10508846]
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human
genome. Nature. 2001; 409:860–921. [PubMed: 11237011]
NIH-PA Author Manuscript
Iseli, C.; Jongeneel, CV.; Bucher, P. ESTScan: A program for detecting, evaluating and reconstructing
potential coding regions in EST sequences. ISMB ‘99 (Proceedings of the 7th International
Conference on Intelligent Systems for Molecular Biology); Menlo Park, Calif: AAAI Press; 1999.
p. 138-148.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;
28:27–30. [PubMed: 10592173]
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai T, Parvizi B, Cheung F, Antonescu V, White
J, Holt I, Liang F, Quackenbush J. Cross-referencing eukaryotic genomes: TIGR Orthologous
Gene Alignments (TOGA). Genome Res. 2002; 12:493–502. [PubMed: 11875039]
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. An optimized protocol for
analysis of EST sequences. Nucleic Acids Res. 2000; 28:3657–3665. [PubMed: 10982889]
Makalowski W, Boguski MS. Evolutionary parameters of the transcribed mammalian genome: An
analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci USA. 1998;
95:9407–9412. [PubMed: 9689093]
Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F,
Parvizi B, Tsai J, Quackenbush J. TIGR Gene Indices clustering tools (TGICL): A software
system for fast clustering of large EST datasets. Bioinformatics. 2003; 19:651–652. [PubMed:
12651724]
NIH-PA Author Manuscript
Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: Reconstruction and
representation of expressed gene sequences. Nucleic Acids Res. 2000; 28:141–145. [PubMed:
10592205]
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White
J. The TIGR Gene Indices: Analysis of gene transcript sequences in highly sampled eukaryotic
species. Nucleic Acids Res. 2001; 29:159–164. [PubMed: 11125077]
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with
complementary DNA microarray. Science. 1995; 270:467–470. [PubMed: 7569999]
Schuler GD. Sequence mapping by electronic PCR. Genome Res. 1997; 7:541–550. [PubMed:
9149949]
Smith TP, Grosse WM, Freking BA, Roberts AJ, Stone RT, Casas E, Wray JE, White J, Cho J,
Fahrenkrug SC, Bennett GL, Heaton MP, Laegreid WW, Rohrer GA, Chitko-McKown CG, Pertea
G, Holt I, Karamycheva S, Liang F, Quackenbush J, Keele JW. Sequence evaluation of four
pooled-tissue normalized bovine cDNA libraries and construction of a gene index for cattle.
Genome Res. 2001; 11:626–630. [PubMed: 11282978]
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 21
Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries.
Genome Res. 2000; 10:2055–2061. [PubMed: 11116099]
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive
NIH-PA Author Manuscript
multiple sequence alignment through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res. 1994; 22:4673–4680. [PubMed: 7984417]
Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F,
Quackenbush J. RESOURCERER: A database for annotating and linking microarray resources
within and across species. Genome Biol. 2001; 2:software0002.1–software0002.4. [PubMed:
16173164]
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science.
1995; 270:484–487. [PubMed: 7570003]
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA,
Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD,
Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson
C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M,
Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L,
Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh
J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R,
Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan
W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei
Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan
VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A,
NIH-PA Author Manuscript
Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H,
Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter
C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I,
Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets
R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C,
Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L,
Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L,
Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers
YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas
R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen
E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B,
Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato
S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J,
Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A,
Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K,
Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J,
Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J,
Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S,
Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T,
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 22
the Expressed Sequence Tags (ESTs) and the Expressed Transcripts (ETs) into Tentative
Consensus (TC) sequences. Singletons (sET and sEST) are ET/EST sequences that are not
incorporated into any of the TCs during assembly. TCs, sETs, and sESTs represent
potentially unique sequences in TGI. As of June 2003, there were 61 species represented by
a Gene Index database. Each line in the table provides information about a single database
and includes a common name, species name, gene index name and version, the total number
of TCs in the current release, and the number of singleton ETs and singleton ESTs. For
some of the Gene Indices, ESTs were pooled from dbEST for the genus, not a single species.
The table is broken into four groups representing animals (42 species), plants (47), fungi
(10) and protists (15).
Table 1.6.2
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 23
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 24
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 25
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 26
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.1.
The DFCI Gene Index home page at http://compbio.dfci.harvard.edu/tgi/tgipage.html has
links to the 114 species-specific databases currently available. Other resources available
include the Eukaryotic Gene Ortholog (EGO) database, the RESOURCERER utility for
annotating and cross-referencing mammalian microarray resources, and maps of the TCs to
completed genome sequences.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 27
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.2.
The home page for the Maize Gene Index.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 28
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.3.
The BLAST search page allows users to query any of the DFCI Gene Index databases, as
well as the EGO and RESOURCERER databases, using protein or DNA sequences.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 29
NIH-PA Author Manuscript
Figure 1.6.4.
The main search page for the Maize Gene Index allows users to search the database using a
variety of accession numbers, including DFCI TC number, a Transcript Identifier, GenBank
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 30
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.5.
Gene Ontology (GO) terms and Enzyme Commission (EC) identifiers are assigned to the
TCs to provide functional annotation and to provide links to metabolic pathway databases.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 31
NIH-PA Author Manuscript
Figure 1.6.6.
NIH-PA Author Manuscript
The GO browser shows the hierarchy of functional assignments for TCs identified as
members of a particular functional class.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 32
NIH-PA Author Manuscript
Figure 1.6.7.
NIH-PA Author Manuscript
For humans, mouse, and rat, TCs are mapped to their respective genomes using the available
radiation hybrid maps.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 33
NIH-PA Author Manuscript
Figure 1.6.8.
RH Mapping Data. A snippet of Mouse TCs containing markers mapped to chromosome 1.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 34
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.9.
The expression summary page allows each Gene Index database to be explored using
information on the libraries from which the ESTs were derived.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 35
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.10.
The Expression Search page allows the frequency of ESTs from various libraries to be
compared in order to identify differentially expressed genes based on the sources of libraries
from which the ESTs were derived.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 36
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.11.
An example of a library-based expression comparison. The relative abundance of ESTs is
depicted using a hot/cold (red/blue) color map and significant differences between classes of
ESTs are denoted by the associated R statistic (Stekel et al., 2000). For the color version of
this figure go to http://www.currentprotocols.com/protocol/bi0106.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 37
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.12.
Gbrowse. ESTs from the various plant Gene Index databases are aligned to the Arabidopsis
thaliana genome sequence.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 38
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.13.
The home page for the Eukaryotic Gene Ortholog (EGO) database.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 39
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 40
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.14.
A TOG alignment from the EGO database showing alignments of a possible transcription
factor from A. Salmon, C.posadaii, cattle, dog, Medicago, oilseed rape, and Trout. (A)
Shows a table with all TC components of the group and their putative function. The next
table shows the blast results. (B) Shows a snippet of the sequence alignments.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 41
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.15.
The RESOURCERER home page allows users to select a variety of widely used microarray
resources for human, mouse, and rat for annotation or cross-platform and cross-species
comparisons. Users can also enter their own microarray platform for annotation by
providing GenBank accession numbers.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 42
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.16.
Annotation for the Affymetrix HG U95Av2 provided by RESOURCER includes Affymetrix
Probe IDs, Clone names (when available), GenBank accessions, UniGene identifiers, DFCI
TC numbers for human identified though EGO, GO terms, and annotated function, Physical
map location based on alignments of the DFCI THCs, with links to the appropriate
databases.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 43
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.17.
RESOURCERER also allows microarray platforms to be compared. Here, annotations for
Affymetrix HG U95Av2 and HG U95C human GeneChip are compared through EGO. Only
elements in common to both datasets are shown (intersection).The annotation includes
Affymetrix Probe IDs, Clone names when available, GenBank IDs with links to NCBI, the
TGI TC numbers for Human (THCs).
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 44
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 45
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 46
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.18.
A sample TC report for Aedes Aegypti TC57832. (A) At the top of each record is a FASTA-
formatted sequence representing the consensus produced by the clustering and assembly
process. Immediately following that are predicted open reading frames, a graphical
representation of the EST, and gene sequences that comprise the TC. (B) Shows a table with
NIH-PA Author Manuscript
links to a variety of resources including GenBank records, source laboratory etc; it also
shows a prediction of the coding strand and the evidence used to support the assignment. (C)
Buttons provide links to expression summaries based on the libraries represented in each TC
assembly, SNPs identified in the TC, and predicted 70-mers oligos. Links to the top 5 results
of the searches against a protein database, GO term and EC number assignments, and links
to Metabolic Pathways in KEGG, are also given.
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 47
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.6.19.
A schematic overview of the Gene Index Assembly process. For each species represented,
EST sequences are downloaded from the dbEST database at the NCBI (http://
www.ncbi.nlm.nih.gov/dbEST). Sequences are cleaned to remove contaminating vector,
adapter, mitochondrial, ribosomal, and other sequences wherever possible. Coding
sequences (annotated CDS regions) representing genes are parsed from GenBank records.
All EST and gene sequences are compared pairwise using megaBLAST and grouped based
on shared sequence similarity. Each cluster is then assembled at high stringency to produce
Tentative Consensus (TC) sequences, which are annotated by sequence similarity search
against a local copy of UNIPROT, and released through the DFCI Web site.
NIH-PA Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
Table 1.6.1
Summary of the Gene Index Databases Mapped to Completed and Draft Genomes
Mouse HGI, MGI, RGI, BtGI, SsGI, DGI, CeGI, AtGI, ScGI
Rat HGI, MGI, RGI, BtGI, SsGI, DGI, CeGI, AtGI, ScGI
Fly DGI, HGI, CeGI, AtGI, ScGI
Worm DGI, HGI, CeGI, AtGI, ScGI
Mosquito AgGI, HGI, DGI, CeGI
Fugu HGI, MGI, RGI, OlGI, XGI, ZGI
Arabidopsis CGI, AtGI, LGI, StGI, GmGi, MtGI, McGI, OGI, ZmGI TaGI, SbGI, HvGI
Yeast ScGI, SpGI, CrGI, NcrGI, AnGI, DGI, HGI, CeGI, AtGI
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Page 48