Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views48 pages

Antonescu 2010

This document discusses using the DFCI Gene Index databases to extract biological information. It describes how to search the databases using BLAST to identify gene sequences, search by genome location, and identify orthologs between species using the Eukaryotic Gene Ortholog database. Methods for searching the databases are provided.

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

Antonescu 2010

This document discusses using the DFCI Gene Index databases to extract biological information. It describes how to search the databases using BLAST to identify gene sequences, search by genome location, and identify orthologs between species using the Eukaryotic Gene Ortholog database. Methods for searching the databases are provided.

Uploaded by

László Sági
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

NIH Public Access

Author Manuscript
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Published in final edited form as:
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. 2010 March ; 0 1: Unit1.6.1–Unit1.636. doi:10.1002/0471250953.bi0106s29.

Using the DFCI Gene Index Databases for Biological Discovery


Corina Antonescu1, Valentin Antonescu1, Razvan Sultana1, and John Quackenbush1
1Dana-Farber Cancer Institute, Boston, Massachusetts

Abstract
The DFCI Gene Index Web pages provide access to analyses of ESTs and gene sequences for
nearly 114 species, as well as a number of resources derived from these. Each species-specific
database is presented using a common format with a home page. A variety of methods exist that
allow users to search each species-specific database. Methods implemented currently include
nucleotide or protein sequence queries using WU-BLAST, text-based searches using various
sequence identifiers, searches by gene, tissue and library name, and searches using functional
classes through Gene Ontology assignments. This protocol provides guidance for using the Gene
NIH-PA Author Manuscript

Index Databases to extract information.

Keywords
gene index database; gene index; databases; DFCI

INTRODUCTION
The DFCI Gene Index Web pages (http://compbio.dfci.harvard.edu/tgi/tgipage.html; Fig.
1.6.1) provide access to analyses of Expressed Sequence Tags (ESTs) and gene sequences
for over 114 species, as well as a number of resources derived from these. A summary of the
species currently represented can be found in the Appendix at the end of this unit; additional
species are regularly added to the collection based on the availability of EST data and user
requests. Each species-specific database is presented using a common format with a home
page similar to that shown in Figure 1.6.2. A variety of methods exist, listed immediately
NIH-PA Author Manuscript

below the heading “Search the Index by,” that allow users to search each species-specific
database. Methods implemented currently include searching of nucleotide or protein
sequences using WU-BLAST (see Basic Protocol 1), text-based searches using various
sequence identifiers (GenBank Accessions and Tentative Consensus (TC) identifiers),
searches by tissue and library names and gene names, and searches using functional classes
through Gene Ontology assignments (UNIT 7.2). In addition, a comprehensive annotation of
all ESTs in the database, based on the annotation of the TCs in which they are contained, is
provided.

The Eukaryotic Gene Ortholog database (EGO; see Basic Protocol 3), which uses DNA
sequence–based comparisons to identify tentative ortholog pairs by linking across the
various Gene Index databases, also provide a means of entry. In addition to providing for
sequence-, accession-, and gene name–based searches, the DFCI Gene Index is also cross-
referenced to the Online Mendelian Inheritance in Man (OMIM) database (UNIT 1.2),
Antonescu et al. Page 2

allowing users to link to Tentative Ortholog Groups (TOGs), and from there to
representative sequences in the individual gene index databases. RESOURCERER, designed
to annotate and cross-reference mammalian orthologs, as well as the Genome Viewers, also
NIH-PA Author Manuscript

provide means of entry to the databases.

The Gene Index Databases are constructed within a species-specific framework, and users
should keep this in mind while using this resource. Although some general search utilities
exist (such as BLAST searches; see Basic Protocol 1), most searches begin with a selection
of a target species (see Alternate Protocols 1 to 5). Each species has a distinct home page
that can be reached through a URL of the form http://compbio.dfci.harvard.edu/tgi/cgi-
bin/tgi/gimain.pl?gudb=xxxx, where xxxx is the appropriate code from the “Common name”
column in Table 1.6.2 (see the Appendix at the end of this unit). Within the Gene Index for
each species, the primary resources available are detailed reports for each of the component
sequences, including the assembled TCs and the individual ESTs, as well as expressed
transcripts (ETs), which are typically annotated CDS features in GenBank records. In most
of the following protocols, the Maize Gene Index will be used as an example; similar tools
and pages exist for the other databases, although the appropriate gene index name must be
substituted in the queries (see Table 1.6.2 for the full list).
NIH-PA Author Manuscript

The completion of a number of eukaryotic genomes provides the opportunity to search the
Gene Index databases by their physical location. A list of available genomes can be found
by following the Genomic Maps link on the DFCI home page to the mapping page, http://
compbio.dfci.harvard.edu/tgi/map.html. A detailed guide to doing such searches is provided
below (see Basic Protocol 2).

TCs from one species can also be found through the mapping of possible orthologs. The
Eukaryotic Gene Ortholog (EGO) database catalogs tentative ortholog groups based on
shared DNA sequence using pairwise reciprocal best matches between species. Details on
using this resource are also included in the unit (see Basic Protocol 3).

The protocols below provide examples of ways to use the Gene Index Databases to extract
and explore the information they provide. The examples are not meant to be exhaustive, but
rather illustrative. Users should note that new features and new species are continuously
being added and that updated versions of these databases are released every four months
NIH-PA Author Manuscript

(February 1, June 1, and October 1).

BASIC PROTOCOL 1: IDENTIFYING A TENTATIVE CONSENSUS (TC)


REPRESENTING A SPECIFIC SEQUENCE WITH BLAST
If one has either nucleotide or amino acid sequences, WU-BLAST 2.0 can be used to search
the collection of TCs, singleton ESTs, and singleton ETs from each species.

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 3

Files—FASTA-formatted sequence (APPENDIX 1B)

1. Open the BLAST search page (Fig. 1.6.3) in the DFCI Gene Indices Web site by
NIH-PA Author Manuscript

one of the following methods.

a. Connect to the Gene Indices home page (http://


compbio.dfci.harvard.edu/tgi/tgipage.html) and select the BLAST
hyperlink from the top menu bar under the Gene Indices pull-down menu
(Fig. 1.6.1).

b. Click on the BLAST link under the “Sequence Similarity Search” heading
on the DFCI Maize Gene Index home page (http://
compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gimain.pl?gudb=maize; Fig.
1.6.2) or the corresponding home page for another species.

c. Directly enter the BLAST search URL http://


compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/Blast/index.cgi.

2. From the Program pull-down menu, select the search program to run: BLASTN
(UNIT 3.3) for a nucleotide query sequence or TBLASTN (UNIT 3.4) for a protein
query, which will be searched against the six-frame translation of the appropriate
NIH-PA Author Manuscript

TGI nucleotide database.


A SAGE tag is a short nucleotide sequence (typically 10 or 14 bp) that has
been found within an mRNA through the construction and sequencing of a
Serial Analysis of Gene Expression (SAGE) library (Velculescu et al.,
1995).

SAGE10 and SAGE14, also included in the Program pull-down menu, are
BLASTN searches using parameters optimized to search SAGE tags 10
and 14 nucleotides in length, respectively.

3. From the Database pull-down menu, select an appropriate target database; one or
more databases can be specified at each time by holding down the Control key
while clicking within the list.

4. Scroll down to the middle of the page. Enter an appropriate FASTA-formatted


sequence either by uploading a file containing a single sequence using the Browse
NIH-PA Author Manuscript

button or pasting it directly into the text box.

The TGI BLAST server does not presently allow multiple sequences to be
searched simultaneously, although such a utility is under development.
Note that although there is no a priori limit on sequence length, some
browsers may time out during searches of long sequences.

5. Users may also select the options other than the defaults for various parameters,
including Alignments (using the pull-down menu right below the Program pull-
down menu), and Matrix, Filter, Expect, Cutoff, Strand, Descriptions, Wordlength,
Echofilter, Graphical Overview, and Ignore Hypotheticals (using the pull-down
menus near the bottom of the page).

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 4

Descriptions for these options can be found by clicking on each button.


Further discussion of the parameters can be found in UNIT 3.3.
NIH-PA Author Manuscript

6. Users are also provided with the option of supplying an e-mail address where they
will be notified when the search is completed.

Although most searches are completed quickly, search time depends on


the sequence length and databases selected, as well as machine use. Search
results are held for 48 hr and then discarded.

7. Standard BLAST search results are returned with alignments. Hyperlinks have been
added to each of the identified target sequences. Target sequence names are
specified in one of three formats depending on their source:

For TC:

species|TCxxxxx; ’THC’ for human

For EST:

species|est name
NIH-PA Author Manuscript

For ET:

species|NP[ET]xxxxxx|GBnucleotide accession| GBprotein accession.

8. Click on the name of the target sequence to retrieve the corresponding TC report.
Review the selected TC report (see Guidelines for Understanding Results).

ALTERNATE PROTOCOL 1: SEARCHING BY TENTATIVE CONSENSUS,


EXPRESSED TRANSCRIPTS, EXPRESSED SEQUENCE TAG, OR GENBANK
IDENTIFIER
The TCs within each gene index can be searched using a variety of accessioned identifiers
that users may get from a variety of sources, including publications or other database
searches. Identifiers that can be used include the TC identifiers, GenBank accession
numbers, EST IDs, and Expressed Transcripts (ETs/NPs).

Necessary Resources
NIH-PA Author Manuscript

Hardware—Computer with Internet access

Software—Web browser

1. Starting from a species home page (e.g., Fig. 1.6.2), click on the “Identifiers or
Keywords” link to open the search page.

For each species, the appropriate URL is of the form http://


compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gireport.pl?gudb=xxxx, where
xxxx is the common name from Table 1.6.2. Figure 1.6.4 shows the search
page for maize.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 5

2. For the identifier chosen, complete the appropriate entry on the form and click the
GO button. Be aware that each of the three types of identifier has a slightly
different specification:
NIH-PA Author Manuscript

For TC:

TC#, the TC identifier, can be either THCxxxxxx (for human) or TCxxxxxxx


(any other species), or just the numerical part of the TC number, xxxxxxx.

For ET:
GB# can be either the GenBank accession of a sequence containing an
annotated CDS, or the corresponding GenPept protein sequence accession.

NP#, the DFCI accession for each CDS feature parsed from GenBank records,
can be of the form NPxxxxx where the HT/ET designators are used to maintain
continuity with the DFCI qcGene database (http://
compbio.dfci.harvard.edu/tgi/qcGene.html).

For EST:

GB# is the GenBank accession of an EST sequence.


NIH-PA Author Manuscript

EST ID is the EST number in each dbEST record.

CLONE Name is a cDNA clone identifier, such as an IMAGE ID, associated


with a particular sequence.

3. For TC number searches, the standard TC report (see Guidelines for Understanding
Results) is returned. For ET and EST searches, the search provides a sequence
report page, with links to the relevant TC report if the ET or EST sequence is not a
singleton.

Unlike accessions in some other databases, the TGI TC numbers are


retired with each build and a new set of accessions is provided. However,
a significant effort has been made to track TC identifiers from one release
to the next, and the header line for each TC FASTA sequence contains the
history of that assembly. Because this information is stored in a relational
table, users can search the database using an “expired” TC number and get
NIH-PA Author Manuscript

the current incarnation.

4. To search keyword(s) in annotations, enter the name to be searched as keyword(s)


or a Boolean expression and hit the GO button. Keep in mind that gene name
searches can be inaccurate, as many genes have multiple names and aliases and that
the gene names in the TGI databases are not curated.
When an exact name search does not yield the expected result, more
general terms related to the target or alternative names should be tried. As
trusted databases with curated gene names become available, these will be
used to update the annotation in TGI.

The search returns a table with information about the query sequence,
including links to the TC in which that sequence is contained.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 6

ALTERNATE PROTOCOL 2: SEARCHING BY GENE ONTOLOGY


FUNCTIONAL CLASSIFICATION
NIH-PA Author Manuscript

Gene Ontology (GO; Ashburner et al., 2000; UNIT 7.2) terms provide classification for
proteins based on three classes: Molecular Function, Biological Process, and Cellular
Component. GO terms and Enzyme Commission (EC) numbers are assigned to TCs by
using BLASTX (UNIT 3.4) to compare the sequence to the SwissProt database and then
using a SwissProt-to-GO translation table provided by the GO Consortium (http://
www.geneontology.org). For inexact matches, the DFCI Gene Indices are conservative and
assign more general terms so as to avoid misclassification. It should be noted that because
GO is evolving, many genes and TCs have not as yet been assigned precise classifications.

The GO terms can be used within any species to find those TCs likely to have a specific
function or to be involved in particular processes.

Necessary Resources
Hardware—Computer with Internet access
NIH-PA Author Manuscript

Software—Web browser

1. Starting from a species home page (e.g., Fig. 1.6.2), click on the Gene Ontology
link under the “Functional Annotation and Analysis” heading to open the Gene
Ontology Assignments page.

Each of the species represented in the TGI has assigned GO terms (UNIT
7.2), and the GO assignments are summarized on a page accessible at a
URL of the form http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/
GObrowser.pl?species=Maize&gidir=zmgi (see Fig. 1.6.5). This page lists
the number of TCs with each class and a bar graph shows the fraction of
all TCs and those with GO assignments falling into each class.
2. Clicking on the functional category of interest, such as “reproduction” brings up a
GO browser (Fig. 1.6.6), which shows both of the subclasses that fall into that
category. Each line includes the current level, child ids, child GO terms, the
number of TCs at this level, and the number of subtree TCs. Clicking on underlined
NIH-PA Author Manuscript

entries in the last two columns brings up a list of TCs within that classification
along with EC numbers linked to the KEGG metabolic pathway database (Kanehisa
and Goto, 2000).

ALTERNATE PROTOCOL 3: SEARCHING BY RADIATION HYBRID MAP


LOCATION (FOR HUMAN, MOUSE, AND RAT ONLY)
The TCs for human, mouse, and rat have been mapped to their corresponding genomes
using the corresponding radiation hybrid (RH) maps that are available. Although genome
sequence is rapidly becoming available for these species, the RH maps remain useful
because many of the markers have also been placed on linkage maps, and as such provide a
useful resource for candidate gene identification in genetic mapping studies. To produce the

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 7

mapped transcript set, the ECs are mapped to the appropriate genome using e-PCR (Schuler,
1997) and the marker data available from a variety of sources (see http://
compgen.rutgers.edu/EnhancedMaps/Default.aspx for a summary). The following method
NIH-PA Author Manuscript

will help find the TCs related to a specific map location.

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

1. Open the URL http://compbio.dfci.harvard.edu/tgi/gi/xxxx/searching/rh_map.html


where xxxx is hgi for human, mgi for mouse, or rgi for rat.

Figure 1.6.7 shows the RH page for mouse (http://


compbio.dfci.harvard.edu/tgi/gi/mgi/searching/xpress_search.html).

2. Select the chromosome to view and set the number of records to be displayed on
each page.

3. Figure 1.6.8 shows the resulting table containing columns for TC#, Marker, 5
NIH-PA Author Manuscript

marker position in TC, 3 marker position in TC, Panel, Chromosome location, and
P-value (from the RH map).

Here, the 5′ and 3′ positions refer to where the mapped RH marker falls
within the mapped TC. Users will be most interested in examining the RH
map location, which provides relative coordinates on the chromosome.

ALTERNATE PROTOCOL 4: SEARCH GENE EXPRESSION BY LIBRARY


ANNOTATION
TCs can be identified based on patterns of gene expression determined using the annotation
of the libraries from which the component ESTs were derived (Smith et al., 2001). It should
be noted that EST library information in dbEST is not curated, and as such may not be
correct or may be represented using nonstandard language. While an attempt has been made
to correct some inconsistencies in the representation of the tissues from which libraries were
derived, many remain.
NIH-PA Author Manuscript

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

1a Access expression information through a URL of the form http://


compbio.dfci.harvard.edu/tgi/gi/xxxx/searching/xpress_search.html, where xxxx
is replaced by gi_symbol (Table 1.6.2) representing the species of interest.

1b Alternatively, starting from a species home page (e.g., Fig. 1.6.2), click on the
“Libraries” link under the “Sequence Reports” heading. Figure 1.6.9 shows an
example from maize.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 8

2a To identify TCs in a given tissue: In the top section of the page (see the “Search
for Tissue Specific Transcripts” section in Fig. 1.6.9), specify a tissue or organ
of interest and a minimum percentage for representation of ESTs from that
NIH-PA Author Manuscript

organ within a TC.

For example, specifying “root” and 50% will return all TCs in which
more than 50% of the ESTs are from root. Clicking on the Search
button returns a table formatted to include:

1st column: the TC number for each TC satisfying the specified


criteria, linked to a TC report.

2nd column: the number of ESTs from specified tissue or organ and the
total number of ESTs within that TC.

3rd column: the fractional representation of the specified tissue or


organ among ESTs in that TC.

4th column: the library catalog numbers (cat#s) corresponding to the


tissue or organ of interest with links to the appropriate library report.
NIH-PA Author Manuscript

5th column: the number of ESTs from each specific library within this
TC.

6th column: the number of ESTs from component libraries for all TCs.

7th column: the number of EST singletons from component libraries.

2b To identify TCs associated with a keyword: In the upper middle section of the
page (see the “Search cDNA Libraries by Keyword” section in Fig. 1.6.9), enter
one or more keywords.

A list of all libraries annotated with those terms is returned, with links
to the appropriate library reports.

2c To identify TCs associated with library identifiers: In the lower middle section
of the page (see the “Search cDNA Libraries by Library Identifier” section in
Fig. 1.6.9), enter the library identifier.

Users can also retrieve library reports by searching the Gene Index
NIH-PA Author Manuscript

databases using the appropriate library identifier parsed from GenBank


EST records. These are the “dbEST lib id” fields from GenBank, and it
should be noted that as these are not curated, some inconsistencies do
exist in the annotation. Users are provided with a list of all TCs linked
to the appropriate TC report containing one or more ESTs annotated as
coming from a particular library.

2d . To compare TCs expressed in two different tissues or organisms: Compare


patterns of gene expression based on library annotation and identify TCs that are
statistically significantly differentially expressed in any one library relative to
others by clicking “Scan a list of TCs by Library Expression” (Fig. 1.6.9) at the
bottom of the page.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 9

This produces a list of libraries from which the user can select those of
interest (Fig. 1.6.10).
NIH-PA Author Manuscript

Users should note that tissue designations come from the library
annotation provided in GenBank records, and, as such, the same tissue
may be represented by different tissue terms. Users can therefore select
multiple tissues for each of the two groups they wish to compare.
Clicking on the “Get Expression” button returns a graphical matrix
representation of expression in which each row represents a TC and
columns represents the R stat, TC#, the number of ESTs in that TC, the
number of ESTs found in libraries selected in group A, and the number
of ESTs found in libraries selected in group B (Fig. 1.6.11).
The results illustrated in Figure 1.6.11 were obtained by selecting tissue
type “aerial, root, whole plant” for Group A, and “aleuron layer” for
group B.
Significant differential expression is identified using the “R statistic”
(Stekel et al., 2000); a large R for a TC indicates that there is a
significant bias toward one or more libraries in that TC.
NIH-PA Author Manuscript

A–B contains all TCs with more than one EST in A and zero or one
EST in B.

ALTERNATE PROTOCOL 5: SEARCHING BY METABOLIC PATHWAY


The Gene Index databases can also be searched by means of metabolic pathway maps.

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

1. Starting from the appropriate Gene Index home page (e.g., Fig. 1.6.2), select
“Metabolic Pathways” link under the “Functional Annotation and Analysis”
heading to produce a graphical representation of a number of metabolic pathways.
NIH-PA Author Manuscript

2. Select an appropriate pathway.

A list of TCs corresponding to elements in that pathway is returned. These


can be used to bring up TC reports corresponding to the individual
pathway elements.

BASIC PROTOCOL 2: USING THE GENOMIC MAPS WITH THE DFCI GENE
INDICES
Completed or draft genome sequences are now available for a number of eukaryotic species,
including Anopheles gambiae, Bos taurus, Caenorhabditis elegans, Danio rerio, Drosophila
melanogaster, Homo sapiens, Gallus gallus, Macaca mulatta, Mus musculus, Pan
troglodytes, Rattus norvegicus, Arabidopsis thaliana, Oryza sativa (rice BACs from the

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 10

Japonica cultivar), Saccharomyces cerevisiae, and Schizosaccharomyces pombe. In addition,


alignments of rice TC sequences for both the Indica and Japonica cultivars are mapped to
the Indica contigs from the draft of that genome (Yu et al., 2002). For all maps, TCs are
NIH-PA Author Manuscript

approximately localized within relevant genomes using MegaBLAST or BLAT, with final
alignments performed using gap2, which incorporates splicing rules and is optimized for
transcript-to-genome alignments. Mapping information is stored in a relational database and
used to create user-friendly Web displays. Table 1.6.1 lists the genomes currently
represented and the Gene Indices that are mapped to each genome.

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

1. . Open the Genomic Maps page by one of the following methods.

a. Connect to the Gene Indices home page (http://


compbio.dfci.harvard.edu/tgi/tgipage.html) and select the “Genomic
Maps” hyperlink from the bar under the DFCI Gene Indices header (Fig.
NIH-PA Author Manuscript

1.6.1).

b. Directly enter the mapping page URL, http://compbio.dfci.harvard.edu/tgi/


map.html.

2. Select the genome to explore by clicking on the appropriate icon.

3. Select an individual chromosome or BAC, as appropriate, for which mapping


information is to be examined.

4. Examine the map.

A representative genome mapping display for Arabidopsis thaliana


chromosome 1 is shown in Figure 1.6.12. The display is divided into two
frames: the upper frame includes navigational and display tools, the lower
shows a graphical representation of individual alignments TC alignments
with the genome; putative exons are represented as colored boxes, introns
as dashed lines, and unmatched regions of the TC as open boxes. To aid in
NIH-PA Author Manuscript

navigating the display, individual species are distinguished by the color of


the mapped TCs. Wherever available, the putative annotation of the
genome is displayed at the top of the lower panel; in the case of human
and mouse, this is the current EnsEMBL annotation. Additional markers
may be added to these displays in the future, including genetic markers.
5. A region of the target chromosome can be selected either by clicking on the
approximate position in the upper left corner of the top panel, or by entering
approximate 3′ and 5′ coordinates. Placing the mouse over a TC in the lower panel
returns information about that TC in the upper panel. At the bottom of the upper
panel, the putative annotation is displayed and on the right hand side details of the
alignment of each putative exon is provided.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 11

BASIC PROTOCOL 3: USING EGO TO IDENTIFY ORTHOLOGOUS GROUPS


The Eukaryotic Gene Ortholog (EGO) database provides putative links between putatively
NIH-PA Author Manuscript

orthologous TCs, as well as an indexed list of TCs linked to disease-associated human genes
through the Online Mendelian Inheritance in Man (OMIM; UNIT 1.2) database. EGO is
based on the results of high-stringency pairwise sequence comparisons between the TCs and
singleton ETs from all TGI databases. Tentative Ortholog Groups (TOGs) are constructed
using a transitive, reflexive closure process based on the assumption of parsimony to
associate sequence-specific best hits, with the requirement that three sequences from
separate species must be represented. Some TCs may belong to multiple TOGs, although
TOGs containing significant overlap in their membership are merged. The result is that in
some instances paralogous sequences appear in the same TOG, particularly if a sequence
from a primitive eukaryote such as yeast is represented in the TOG. Each TOG is assigned a
unique accession number (TOG #) that can be used to reference the collection of sequences.
EGO has been a valuable tool for identifying orthologs of known genes as well as those
existing only as uncharacterized ESTs.

Necessary Resources
NIH-PA Author Manuscript

Hardware—Computer with Internet access

Software—Web browser

1. Open the EGO page by one of the following methods.

a. Connect to the Gene Indices home page (http://


compbio.dfci.harvard.edu/tgi/tgipage.html) and select the EGO hyperlink
from the bar under the DFCI Gene Indices header (Fig. 1.6.1).

b. Directly enter the EGO URL, http://compbio.dfci.harvard.edu/tgi/ego/.


The main EGO page is returned (Fig. 1.6.13). On the EGO main
page, there are links to two search functions, Search the Ortholog
Database and Orthologs of Human Disease Genes.
2. Clicking on the Search the Ortholog Database button brings up a page that allows
searches to be done using nucleotide or protein searches through BLAST (UNITS
NIH-PA Author Manuscript

3.3 & 3.4), using TOG numbers, using gene names, or using TCs from any of the
species within EGO. The next page will be a list of orthologs. The title (Tentative
Ortholog xxxx) is a link to a more detailed report. A representative TOG report for
a putative transcription factor gene is shown in Figure 1.6.14A,B.

3. The Orthologs of Human Disease Genes link from the EGO home page allows
searches by Omim Identifier, OMIM Locus ID, Gene name (such as CDK2, cyclin-
dependent kinase 2) GenBank Accession number, DFCI Accession Number (for
human only), or EGO Identifier.

4. TOG reports have three main parts as shown in Figure 1.6.14 (A,B). At the top is a
table listing the component TCs with putative annotation and links to the
component TC reports. There is also a graphic representing the connections

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 12

between the component sequences used for constructing the TOG. Below the TOG
is a table listing the results of all pair-wise searches contributing to the TOG, with
percent identity, match length, p-value, and asterisks marking reciprocal best hits.
NIH-PA Author Manuscript

At the bottom of each TOG report is a ClustalW alignment showing the


relationship between the aligned DNA sequences; this alignment can also viewed
using JalView (http://www.jalview.org/).

BASIC PROTOCOL 4: USING RESOURCERER


RESOURCERER is a microarray resource annotation and cross-reference database—i.e., a
resource for microarray experiments that both provides annotation for widely used platforms
and makes it possible to compare gene expression from experiments in one species with
expression patterns discerned in the same or another species. Annotation for any microarray
platform or clone set included in RESOURCERER is provided through the appropriate Gene
Index database. Comparisons between different resources for one species are provided
through TGI, and comparisons across species are derived from EGO. At present, only
human, mouse, rat, zebrafish, xenopus, cattle, C.elegans, and rice are represented in
RESOURCERER, but other species will be added as standard platforms come into
widespread use. Microarray resources represented in RESOURCERER include cDNA clone
NIH-PA Author Manuscript

sets, long oligo sets from Operon/Qiagen and Compugen/Sigma, and the Affymetrix
GeneChips for representative species.

Necessary Resources
Hardware—Computer with Internet access

Software—Web browser

1. Open the RESOURCERER page by one of the following methods.

a. Connect to the Gene Indices home page (http://compbio.dfci.harvard.edu/


tgi/) and select the RESOURCERER hyperlink from the bar under the
DFCI Gene Indices header (Fig. 1.6.1).

b. Directly enter the RESOURCERER URL, http://


compbio.dfci.harvard.edu/tgi/cgi-bin/magic/r1.pl.
NIH-PA Author Manuscript

The main RESOURCERER page is returned (Fig. 1.6.15) with


summary instructions on its use and links to a more extensive
README.
2. To obtain annotation for a single microarray resource already in the database, select
resource using the drop-down menu, then Submit. On the next page, you can select
annotation fields of your interest. Clicking the Get Table button returns annotations
similar to that shown in Figure 1.6.16, including:

The clone name associated with each element, if available

Either the Rearray ID assigned by the clone set developer or the Affymetrix
Probe ID, as appropriate

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 13

A representative GenBank Accession number

UniGene IDs
NIH-PA Author Manuscript

Locus Link IDs

Physical map location based on alignments of the DFCI TCs to the appropriate
draft genome sequence

The TC number for the appropriate species

TC numbers from the other mammalian species


Assigned GO Terms based on the TC assignment

Putative annotation

For mouse, a corresponding Mouse Genome Informatics (MGI) database


accession.

Where appropriate, elements in the table are hot-linked to an appropriate


database, including the NetAffx database for Affymetrix probe_ids.

3. To compare the elements represented in two microarray resources, on the main


NIH-PA Author Manuscript

RESOURCERER page (Fig. 1.6.15), simply select a first resource as Data Set A
and the “Compare to another resource” radio button. On the next page, you will
select the second source—Data Set B. Also, select whether the comparisons should
be made through EGO (and the TGI), which returns valid comparisons either
within a single species or between species, or UniGene IDs, which is only valid for
comparisons within a single species. Finally, select the type of comparison:
whether the search should return those elements in common to both Data Set A and
Data Set B (Intersection), those unique to Data Set A (A unique), or those unique to
Data Set B (B unique).

Clicking the Get Table button returns a cross-reference table that contains
annotations similar to that shown in Figure 1.6.17, again with appropriate
links to other databases.

GUIDELINES FOR UNDERSTANDING RESULTS


NIH-PA Author Manuscript

Examining a TC Report
The TC sequences are the central elements in the DFCI Gene Index databases. The TCs are
assembled from EST and gene sequences and represent likely transcripts encoded within a
particular genome. In that sense, the TCs are distinct from clusters in other approaches such
as UniGene in that alternative splice forms and gene family members are more likely to be
represented by separate objects in our databases. This has some advantages and
disadvantages based on one’s application of interest. In principle, with a large-enough
collection of ESTs, sequences representing a wide variety of tissues, developmental stages,
and disease states, the Gene Index databases would reconstruct the entire transcriptome of a
particular organism.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 14

Figure 1.6.18A,B,C shows a representative TC with the annotation and features provided in
each Gene Index. TCs are indexed by an accession of the form TCyyyy, where yyyy is a
number that is simply a sequential identifier assigned each time each database is rebuilt. For
NIH-PA Author Manuscript

each species-specific database, TC numbers are never reused. However, TC numbers are
tracked through subsequent builds so that users with a TC number from a previous release of
the database can get the current representation of that particular transcript.

TC reports can be accessed in a variety of ways, many of which are detailed below.
However, users can link directly to any TC report by entering a URL of the form http://
compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/tcreport.pl?species=xxx&tc=yyyy, where xxx is the
common species name (typed exactly as in the table, without spaces) from the first column
of Table 1.6.2 (no spaces) and yyyy is the TC number of interest. If a TC number from a
previous build is used, the corresponding TC from the current build is provided. Note that
for human, TC is replaced by THC.

Each TC report contains the following features:

1. At the top of Figure 1.6.18A is a FASTA-formatted sequence representing the


consensus assembly of its component EST and gene sequences. The FASTA header
NIH-PA Author Manuscript

includes the current TC number assigned to that assembly, as well as previous TC


numbers associated with it; this allows users to track TC numbers through various
rebuilds of the database. Wherever possible, predicted polyadenylation signals are
identified and highlighted in red within the sequence.

2. Immediately below the FASTA sequence is a graphical representation of the TC


with putative open reading frames (ORFs) predicted using NCBI’s ORF Finder,
ESTScan, FRAMEFINDER, and DIANAEST (Iseli et al., 1999; Hatzigeorgiou et
al., 2001). ORF Finder scans each of the six potential reading frames looking for
ORFs; the remaining programs use a variety of approaches to identify and correct
reading frame errors and to select the most likely ORF for each TC. The bars
representing the ORFs are active; clicking on them takes the user to a page from
which they can explore the properties of the predicted protein-coding sequence
(Fig. 1.6.18A).

3. Below the predicted ORFs is a map representing the individual sequences that
NIH-PA Author Manuscript

comprise the TC, showing their approximate position in the TC and their relative
lengths. Each sequence is represented by an arrow showing orientation, and paired
reads from the same clone are linked by a dotted line. Annotated mRNA sequences
are highlighted in pink. All sequences are numbered and indexed to a table of
linked identifiers, which immediately follows the map (Fig. 1.6.18A).

4. A table lists the individual sequences comprising the TC, indexed by numbers
appearing in the sequence map (Fig. 1.6.18B). Each row in the table represents a
particular EST or gene sequence and these are annotated with a source laboratory
(wherever possible), a sequence ID, a GenBank accession, clone name, the 5′
position in the TC, 3′ position in the TC, and source library annotation. Wherever
possible, these entries are linked to other databases or sources of information. The
sequence ID is linked to an EST report or ET report page at DFCI, the GenBank

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 15

accession is linked to a sequence record at NCBI, and the clone name, wherever
possible, is linked to a public clone repository. Immediately following the list is a
key to the clone source codes used, showing the laboratories from which the clone
NIH-PA Author Manuscript

sequences were generated; all ET sequences are coded as ETG. Links to


laboratories contributing a significant number of ESTs from a particular species can
be found on the home page for each species-specific Gene Index.

5. Assembling the TCs can produce consensus sequences with arbitrary orientation.
Using annotated information about the component sequences, including the
presence of mRNA sequences and the 3′ and 5′ orientation of the ESTs, one
attempts to identify the appropriate orientation of the TC and provide the evidence
used for that determination (Fig. 1.6.18B).

6. Alternative splice forms, identified through alignment of TCs within each TGI
database, can be found by clicking on the “Alternative Splice Forms” button (when
info is available).

7. An expression summary, based on the libraries from which the ESTs were derived,
can be found by clicking on the Expression Summary button (Fig. 1.6.18C).
NIH-PA Author Manuscript

8. Putative gene identification is made using a variety of methods. First, TCs are
annotated using the names associated with any mRNA sequences they contain; this
is listed as the Putative ID for each TC. The consensus sequences are also searched
against a nonredundant protein database; the top five hits are listed and a controlled
vocabulary is used with these to assign a name to each (Fig. 1.6.18C).

9. The “GO annotation” lists assignments based on the Gene Ontology project’s
classifications (http://www.geneontology.org; UNIT 7.2). TCs are searched against
SwissProt and SwissProt-to-GO tables to provide conservative assignments based
on the level of sequence homology (Fig. 1.6.18C).

10. Potential orthologs are identified through the EGO database. A detailed description
of the EGO database is provided in Basic Protocol 3.

11. TC sequences are also mapped to a variety of completed eukaryotic genomes. At


the bottom of each TC report is a “Maps to” section with links to alignments with
draft or completed genomic sequences from model organisms.
NIH-PA Author Manuscript

12. TC reports may also contain buttons providing links to single-nucleotide


polymorphisms (SNPs) identified in the TC sequence, as well as predicted 70-mer
oligos for microarray projects (Fig. 1.6.18C).

COMMENTARY
Background Information
The goal of any genome project is the identification and functional characterization of the
entire catalog of the genes encoded within a particular genome. Although genome
sequencing projects in human, mouse, Arabidopsis, and other eukaryotic species have
generated a wealth of data, identification of the genes encoded in the sequence and
assignment of function to these remains a significant challenge. Nowhere is that more

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 16

apparent than in the two completed drafts of the human genome (International Human
Genome Sequencing Consortium, 2001; Venter et al., 2001), where an independent analysis
of the competing annotations has found that many of the gene predictions, other than for
NIH-PA Author Manuscript

previously known genes, are disjoint (Hogenesch et al., 2001). Indeed, it is becoming
increasingly clear that the completion of a genome sequence is only a starting point and that
significant additional analysis is required before one can declare its annotation, and the
genome itself, complete.

The sequencing of ESTs continues to supply important insight into the transcribed genes in a
wide variety of species and has become a widely used approach to gene discovery and the
analysis of gene expression. ESTs are the most extensive available survey of the transcribed
portion of the eukaryotic genomes; there are currently more than 10,000,000 ESTs in
GenBank, nearly 45% of which are human and 75% of which represent higher mammals
(human, mouse, rat, cattle, and pig; http://www.ncbi.nlm.nih.gov/dbEST/
dbEST_summary_html). For many species, ESTs remain the primary source of gene
sequence data and provide a basic survey of gene expression in various tissues, as well as in
various developmental and disease states. ESTs have also proven their value in genome
annotation as they provide experimental evidence for the presence of the genes, their
NIH-PA Author Manuscript

genomic structure, and patterns of expression.

However, analysis of ESTs presents a number of challenges as each sequence typically


represents only a partial gene sequence and EST projects generally produce very large
numbers of redundant sequences. The DFCI Gene Indices (TGI; Quackenbush et al., 2001;
http:/compbio.dfci.harvard.edu/tgi/publications/NAR_GeneIndex2001.pdf) attempt to avoid
these limitations by first clustering, then assembling ESTs to reconstruct the original gene
transcripts (mRNAs) as high-fidelity, virtual transcripts. While there are many other projects
that cluster ESTs, including UniGene (Boguski and Schuler, 1995) and IMAGEne (Cariaso
et al., 1999), and others that assemble EST clusters such as STACK (Christoffels et al.,
2001) and DoTS (http://www.allgenes.org), the DFCI Gene Indices have distinguished
themselves by producing high-fidelity EST assemblies for over 60 species (see Table 1.6.2).
The indices provide annotation and other ancillary information about the genes, their
structure, genomic localization (Quackenbush et al., 2000, 2001), and potential orthologs
and paralogs (Lee, 2002), and serve as a resource for comparative sequence analysis (Tsai,
2001).
NIH-PA Author Manuscript

Obtaining data from the DFCI Gene Index databases through FTP—As an
alternative to using the Web site, flat-file versions of all of the DFCI databases are available
through FTP links on each Gene Index home page and the EGO page; RESOURCERER flat
files can be downloaded through the Web site. The Gene Index download files include:

1. A multi FASTA file with TC sequences (annotation in the defline) and singleton
sequences.

2. A FASTA file, containing the complete set of TC sequences for that species with
TC identifiers from previous builds in the definition line.

3. A tab-delimited file containing the TC identifiers and the ESTs that comprise them.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 17

The FASTA files can be used to create local BLAST databases, or used for other
purposes. The file that includes TC numbers and the list of ESTs can be used for
linking ESTs to the TCs that contain them.
NIH-PA Author Manuscript

4. A multi-FASTA file indexed by TC; information includes GO ID, GO Term, E.C.


Number, GO category.

5. Oligomer data is available for some gene indices.

Assembly of the Gene Index databases—The DFCI Gene Indices are assembled
independently for each species using a “divide and conquer” approach in which ESTs are
first placed in clusters based on sequence similarity and then assembled on a cluster-by-
cluster basis to produce Tentative Consensus (TC) sequences (Liang et al., 2000;
Quackenbush et al., 2001). A schematic overview of this process is shown in Figure 1.6.19
and a software implementation of the clustering and assembly tools used, TGICL, is freely
available (Pertea et al., 2003; http://compbio.dfci.harvard.edu/tgi/software/). TGICL is an
open-source pipeline for analysis of large EST and mRNA databases, in which sequences
are first clustered based on pairwise sequence similarity and then assembled to produce the
TC sequences.
NIH-PA Author Manuscript

Briefly, ESTs and coding gene sequences are first downloaded from dbEST and parsed from
GenBank records. The annotated CDS features in GenBank records are assigned NP (for
nucleotide-protein) identifiers to provide a unique accession for each coding DNA sequence;
some GenBank records have multiple annotated coding features. Sequences are trimmed to
remove vector, poly(A/T) tails, adaptor sequences, and contaminating bacterial sequences.
Clustering begins by indexing a multi-FASTA-formatted sequence database and performing
all-versus-all pairwise similarity searches. The authors use mgblast, a modified version of
the megablast program (Zhang et al., 2000), for this purpose. The mgblast program differs
from the original megablast program in that it produces a simple tab-delimited output, uses
specific output filtering options such as minimum overlap length and identity, and allows the
use of a dynamic offset within the database when performing incremental searches of
portions (slices) of the database against itself. Each line in the mg-blast output represents
one identified overlap between two sequences in the database. The search results are sorted
in order by decreasing pairwise alignment score. The sequence overlaps are filtered using
user-defined criteria: the minimum overlap length (default 40 base-pairs), the minimum
NIH-PA Author Manuscript

percent identity for the overlap (default 95%), and the maximum mismatched overhang
allowed around the overlap (dynamically adjusted for long sequences and long overlaps; the
default value starts at 30 nucleotides). Based on the results of these similarity searches,
sequences are grouped into clusters using a transitive closure approach and a graph
representation in which the sequences are the graph nodes and the alignments represent
edges (Pertea et al., 2003); the resulting clusters represent the connected subgraphs within
the dataset.

This clustering stage is an important step if one then wants to assemble the expressed
sequences to reconstruct the transcripts they represent. Most sequence-assembly programs
were developed for genomic applications and face particular difficulty in dealing with the
challenges presented by ESTs, including extremely deep and uneven coverage from diverse

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 18

biological sources, low-quality sequences often without quality scores, relatively frequent
chimerism, and a moderately high rate of vector and adapter contamination. Further, while
many DNA sequence assembly programs assemble contigs from large numbers of
NIH-PA Author Manuscript

sequences, they can easily be overwhelmed by a very large unpartitioned dataset and
produce incorrect chimeric assemblies (Liang et al., 2000).

A systematic analysis of the performance of various sequence-assembly programs (Liang et


al., 2000) led the authors of this unit to select the Paracel Transcript Assembler (PTA), an
improved version of CAP3 (Huang and Madan, 1999), to independently assemble each
cluster. The assembly process produces a collection of Tentative Consensus sequences
(TCs) and a set of unassembled singletons. The TCs are annotated in preparation for release
on the DFCI Gene Index Web site. First, TCs are searched against a variety of DNA and
protein databases and high-scoring hits are used to provide putative functional annotation
using a controlled vocabulary. Hits to SwissProt records are used to assign Gene Ontology
(GO) terms and Enzyme Commission (EC) Numbers using a SwissProt to GO translation
table provided by the GO consortium (http://www.geneontology.org; UNIT 7.2). Open
reading frames in each sequence are assigned using NCBI’s ORF Finder, ESTScan,
FRAMEFINDER, and DIANAEST (Iseli et al., 1999; Hatzigeorgiou et al., 2001). The
NIH-PA Author Manuscript

orientation of each TC is determined using a consensus-based approach that uses the


orientation and identity of its component sequences. Additional information and annotation
for each sequence is provided through links to the EGO database (see below), to completed
genomic sequences, to other maps where available, and to other appropriate annotation
databases including the Mouse Genome Database at The Jackson Laboratory (http://
www.jax.org). The TGI home page is shown in Figure 1.6.1 and a representative TC report
in Figure 1.6.18A,B,C.

Evaluation of orthologous genes—Cross-referencing the available genomic data has a


number of important applications, including the identification of homologous genes in
eukaryotes. Gene homologs can be separated into two classes, orthologs and paralogs (Fitch,
1970). Orthologs are genes that are related by direct evolutionary descent while paralogs are
homologous genes that are the result of a duplication event within the same lineage. The
identification of orthologs is particularly important since these genes should play similar
developmental or physiological roles and should therefore share conserved functional and
regulatory domain. Further, the study of these genes in one organism can provide insight
NIH-PA Author Manuscript

into their function in others.

Makalowski and Boguski (1998) conducted what was at the time the most comprehensive
survey of eukaryotic orthologs available. Their dataset contained 1880 rodent-human
ortholog pairs and 470 sequences shared by all three species. Their analysis of both the
coding and noncoding regions indicated that not only are both the DNA and protein coding
regions highly conserved in mammals, but, more surprisingly, that the flanking 5′ and 3′
noncoding regions are extremely well conserved and that the evolutionary distance
estimated for the 5′ and 3′ UTRs are similar and generally indistinguishable from that for
synonymous coding sites. This suggested to the authors of this unit that EST sequences,
derived primarily from the 3′ UTR, could be used to identify orthologs in closely related
species. Based on this observation, and the fact that the TC sequences within the DFCI Gene

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 19

Index databases represented the most comprehensive survey of eukaryotic gene sequences
available at the time, the authors began construction of the Eukaryotic Gene Ortholog (EGO;
Lee et al., 2002) database in 1999. EGO has allowed identification and cataloging of more
NIH-PA Author Manuscript

than 86,630 tentative orthologous groups in eukaryotes and it provides a tool for cross-
referencing other genomic resources, including commonly used resources for DNA
microarrays (Tsai et al., 2001).

Identification of Tentative Ortholog Groups (TOGs)—Tentative Consensus


sequences (TCs) and the singleton Expressed Transcripts (sETs) from each of the DFCI
Gene Indices are concatenated into a single multiFASTA database, which is partitioned and
used in all-versus-all pairwise searches using mgblast. Matches scoring better than a
maximum e-value of 10–10 are recorded. Reciprocal best hits, defined as pairs of sequences
from separate species that independently identify each other as a best match in their
respective species, are identified, and a transitive closure process using these pairs and
requiring sequences in three or more species is used to identify tentative orthologs (TOGs).
Multiple alignments of each of the TOG sequences are preformed using ClustalW
(Thompson et al., 1994; UNIT 2.3) and are displayed at http://compbio.dfci.harvard.edu/tgi/
ego/ with links to the individual TC reports; alignments can also be viewed using JalView
NIH-PA Author Manuscript

(go to http://www.jalview.org/). The individual sequences in EGO can be searched by


BLAST (UNITS 3.3 & 3.4) and all of the orthologous genes are cross-referenced to the
Online Mendelian Inheritance in Man (OMIM; UNIT 1.2) database. A representative TOG
is shown in Figure 1.6.14A,B.

Annotation of mammalian microarray resources—DNA microarray analysis


(Schena et al., 1995) has emerged as one of the most widely used techniques for assessment
of gene expression on a genomic scale, allowing tens of thousands of genes to be assayed in
a single experiment. However, the widespread use of this technique has resulted in a
proliferation of experimental platforms and reagents, making a comparison of results from
different experimental groups a significant challenge. An additional and possibly more
important need is the ability to make comparisons of gene expression patterns between
species. Analysis of expression in model organisms, particularly mouse and rat, has become
a fundamental tool for the study of human development and disease. Effective use of these
animal models with microarray assays requires the development of a convenient means of
NIH-PA Author Manuscript

identifying corresponding array elements between species and platforms. To address these
issues, the authors of this unit developed RESOURCERER (http://
compbio.dfci.harvard.edu/tgi/cgi-bin/magic/p1.pl), a utility designed to provide annotation
for and comparisons between widely used microarray platforms. RESOURCERER provides
information for the most widely used microarray mammalian gene resources, including the
Research Genetics Sequence Verified Human cDNA clone set, the BMAP and NIA mouse
clone sets, the DFCI Rat Gene Index cDNA collection, human and mouse 70-mer oligo sets
from Operon, and the Affymetrix Human, Mouse, and Rat GeneChip sets.

Literature Cited
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS,
Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE,

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 20

Ringwald M, Rubin GM, Sherlock G. Gene ontology: Tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet. 2000; 25:25–29. [PubMed: 10802651]
Boguski MS, Schuler GD. Establishing a human transcript map. Nat Genet. 1995; 10:369–371.
NIH-PA Author Manuscript

[PubMed: 7670480]
Cariaso M, Folta P, Wagner M, Kuczmarski T, Lennon G. IMAGEne I: Clustering and ranking of
I.M.A.G.E. cDNA clones corresponding to known genes. Bioinformatics. 1999; 15:965–973.
[PubMed: 10745985]
Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W. STACK: Sequence Tag
Alignment and Consensus Knowledgebase. Nucleic Acids Res. 2001; 29:234–238. [PubMed:
11125101]
Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970; 19:99–113.
[PubMed: 5449325]
Hatzigeorgiou AG, Fiziev P, Reczko M. DIANA-EST: A statistical analysis. Bioinformatics. 2001;
17:913–919. [PubMed: 11673235]
Hogenesch JB, Ching KA, Batalov S, Su AI, Walker JR, Zhou Y, Kay SA, Schultz PG, Cooke MP. A
comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes.
Cell. 2001; 106:413–415. [PubMed: 11534548]
Huang X, Madan A. CAP3: A DNA sequence assembly program. Genome Res. 1999; 9:868–877.
[PubMed: 10508846]
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human
genome. Nature. 2001; 409:860–921. [PubMed: 11237011]
NIH-PA Author Manuscript

Iseli, C.; Jongeneel, CV.; Bucher, P. ESTScan: A program for detecting, evaluating and reconstructing
potential coding regions in EST sequences. ISMB ‘99 (Proceedings of the 7th International
Conference on Intelligent Systems for Molecular Biology); Menlo Park, Calif: AAAI Press; 1999.
p. 138-148.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;
28:27–30. [PubMed: 10592173]
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai T, Parvizi B, Cheung F, Antonescu V, White
J, Holt I, Liang F, Quackenbush J. Cross-referencing eukaryotic genomes: TIGR Orthologous
Gene Alignments (TOGA). Genome Res. 2002; 12:493–502. [PubMed: 11875039]
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. An optimized protocol for
analysis of EST sequences. Nucleic Acids Res. 2000; 28:3657–3665. [PubMed: 10982889]
Makalowski W, Boguski MS. Evolutionary parameters of the transcribed mammalian genome: An
analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci USA. 1998;
95:9407–9412. [PubMed: 9689093]
Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F,
Parvizi B, Tsai J, Quackenbush J. TIGR Gene Indices clustering tools (TGICL): A software
system for fast clustering of large EST datasets. Bioinformatics. 2003; 19:651–652. [PubMed:
12651724]
NIH-PA Author Manuscript

Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: Reconstruction and
representation of expressed gene sequences. Nucleic Acids Res. 2000; 28:141–145. [PubMed:
10592205]
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White
J. The TIGR Gene Indices: Analysis of gene transcript sequences in highly sampled eukaryotic
species. Nucleic Acids Res. 2001; 29:159–164. [PubMed: 11125077]
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with
complementary DNA microarray. Science. 1995; 270:467–470. [PubMed: 7569999]
Schuler GD. Sequence mapping by electronic PCR. Genome Res. 1997; 7:541–550. [PubMed:
9149949]
Smith TP, Grosse WM, Freking BA, Roberts AJ, Stone RT, Casas E, Wray JE, White J, Cho J,
Fahrenkrug SC, Bennett GL, Heaton MP, Laegreid WW, Rohrer GA, Chitko-McKown CG, Pertea
G, Holt I, Karamycheva S, Liang F, Quackenbush J, Keele JW. Sequence evaluation of four
pooled-tissue normalized bovine cDNA libraries and construction of a gene index for cattle.
Genome Res. 2001; 11:626–630. [PubMed: 11282978]

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 21

Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries.
Genome Res. 2000; 10:2055–2061. [PubMed: 11116099]
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive
NIH-PA Author Manuscript

multiple sequence alignment through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res. 1994; 22:4673–4680. [PubMed: 7984417]
Tsai J, Sultana R, Lee Y, Pertea G, Karamycheva S, Antonescu V, Cho J, Parvizi B, Cheung F,
Quackenbush J. RESOURCERER: A database for annotating and linking microarray resources
within and across species. Genome Biol. 2001; 2:software0002.1–software0002.4. [PubMed:
16173164]
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science.
1995; 270:484–487. [PubMed: 7570003]
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA,
Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD,
Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson
C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M,
Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L,
Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh
J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R,
Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan
W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei
Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan
VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A,
NIH-PA Author Manuscript

Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H,
Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter
C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I,
Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets
R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C,
Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L,
Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L,
Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers
YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas
R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen
E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B,
Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato
S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J,
Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A,
Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K,
Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J,
Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J,
Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S,
Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T,
NIH-PA Author Manuscript

Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X. The sequence of


the human genome. Science. 2001; 291:1304–1351. [PubMed: 11181995]
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, Cao M, Liu J, Sun J,
Tang J, Chen Y, Huang X, Lin W, Ye C, Tong W, Cong L, Geng J, Han Y, Li L, Li W, Hu G,
Huang X, Li W, Li J, Liu Z, Li L, Liu J, Qi Q, Liu J, Li L, Li T, Wang X, Lu H, Wu T, Zhu M, Ni
P, Han H, Dong W, Ren X, Feng X, Cui P, Li X, Wang H, Xu X, Zhai W, Xu Z, Zhang J, He S,
Zhang J, Xu J, Zhang K, Zheng X, Dong J, Zeng W, Tao L, Ye J, Tan J, Ren X, Chen X, He J, Liu
D, Tian W, Tian C, Xia H, Bao Q, Li G, Gao H, Cao T, Wang J, Zhao W, Li P, Chen W, Wang X,
Zhang Y, Hu J, Wang J, Liu S, Yang J, Zhang G, Xiong Y, Li Z, Mao L, Zhou C, Zhu Z, Chen R,
Hao B, Zheng W, Chen S, Guo W, Li G, Liu S, Tao M, Wang J, Zhu L, Yuan L, Yang H. A draft
sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002; 296:79–92. [PubMed:
11935017]
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J
Comput Biol. 2000; 7:203–214. [PubMed: 10890397]

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 22

APPENDIX: DFCI GENE INDICES


DFCI Gene Indices (Table 1.6.2) are a collection of species-based databases that assemble
NIH-PA Author Manuscript

the Expressed Sequence Tags (ESTs) and the Expressed Transcripts (ETs) into Tentative
Consensus (TC) sequences. Singletons (sET and sEST) are ET/EST sequences that are not
incorporated into any of the TCs during assembly. TCs, sETs, and sESTs represent
potentially unique sequences in TGI. As of June 2003, there were 61 species represented by
a Gene Index database. Each line in the table provides information about a single database
and includes a common name, species name, gene index name and version, the total number
of TCs in the current release, and the number of singleton ETs and singleton ESTs. For
some of the Gene Indices, ESTs were pooled from dbEST for the genus, not a single species.
The table is broken into four groups representing animals (42 species), plants (47), fungi
(10) and protists (15).

Table 1.6.2

Summary of DFCI Gene Indices (TGI), June 2009 Release

Common name Species name GI name Release TCs sETs sESTs


NIH-PA Author Manuscript

Animal (42 species)


A.aegypti Aedes aegypti AeGI 5.0 25627 110 14880
A.burtoni Astatotilapia burtoni AbGI 2.1 1284 51 6675
A.salmon Salmo salar AsGI 4.0 49630 369 40458
A.variegatum Amblyomma variegatum AvGI 2.0 490 1 1661
B.malayi Brugia malayi BmGI 5.1 2565 52 7477
B.microplus Boophilus microplus BmiGI 2.1 9851 39 4696
Bear Ursus americanus UaGI 4.1 4925 29 12719
Black_tick Ixodes scapularis IsGI 3.0 20932 23 17437
C.elegans Caenorhabditis elegans CeGI 9.0 17951 5035 7933
C.intestinalis Ciona intestinalis CinGI 5.0 31571 147 16349
Catfish Ictalurus punctatus Cfgi 7.0 5342 310 19908
Cattle Bos taurus BtGI 12.0 90392 491 110291
Chicken Gallus gallus GgGI 11.0 75408 860 112983
Cricket Laupala kohalensis LkGI 2.0 2562 0 6013
NIH-PA Author Manuscript

Dog Canis familiaris DogGI 7.0 32481 570 32481


Drosophila Drosophila melanogaster DGI 11.0 27100 1314 14124
Fathead_minnow Pimephales promelas PpGI 1.0 27048 0 29623
Frog Xenopus laevis XGI 10.1 56494 406 58231
Fugu Takifugu rubripes FGI 3.0 3961 550 7812
H.chilotes Haplochromis chilotes HchGI 1.1 2291 8 4140
H.red_tail_sheller Haplochromis sp red tail sheller HsGI 1.1 1942 0 4562
Honeybee Apis mellifera AMGI 5.0 12167 3202 9640
Human Homo sapiens HGI 17.0 328301 19585 736049
Hydra Hydra magnipapillata HmGI 1.0 15510 13 22276
Killifish Fundulus heteroclitus FhGI 4.0 9251 26 26933

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 23

Common name Species name GI name Release TCs sETs sESTs


NIH-PA Author Manuscript

Locust Locusta migratoria LomiGI 1.0 4355 19 7625


Macaca_cynomolgus Macaca fascicularis MfGI 1.0 12613 851 68561
Medaka Oryzias latipes OlGI 8.0 37198 230 30997
Mosquito Anopheles gambiae AgGI 9.0 22557 9603 18782
Mouse Mus musculus MGI 16.0 210249 10684 769704
O.volvulus Onchocerca volvulus OvGI 4.1 1205 30 3283
Pea_aphid Acyrthosiphon pisum AcpiGI 1.0 17251 8 17704
Pig Sus scrofa SsGI 13.0 104293 819 132636
R.appendiculatus Rhipicephalus appendiculatus RaGI 2.1 2642 24 4917
R.trout Oncorhynchus mykiss RtGI 7.0 40320 291 49408
Rat Rattus norvegicus RGI 14.0 76570 2497 105867
Red_flour_beetle Tribolium castaneum TrcaGI 1.0 8594 4514 15273
S.mansoni Schistosoma mansoni SmGI 7.0 19291 80 28026
Sheep Ovis aries OaGI 1.0 22305 311 28783
X.tropicalis Xenopus tropicalis XtGI 3.1 69590 87 81625
NIH-PA Author Manuscript

Zebra_finch Taeniopygia guttata TaguGI 1.0 14384 36 19443


Zebrafish Danio rerio ZGI 17.0 63667 829 85826
Plant (47 species)
A.cepa Allium cepa OnGI 2.0 4063 27 8155
Apple Malus x domestica MdGI 2.0 31789 38 26448
Aquilegia Aquilegia AqGI 2.1 13556 111 7278
Arabidopsis Arabidopsis thaliana AtGI 13.0 34155 8632 39039
Barley Hordeum vulgare HvGI 10.0 41206 172 39345
Bean Phaseolus vulgaris PhvGI 3.0 11940 142 9415
Beet Beta vulgaris BvGI 2.0 4784 132 12235
C.reinhardtii Chlamydomonas reinhardtii ChrGI 6.0 15554 119 26535
Clementine Citrus clementina CiclGI 2.0 32287 2 10229
Cocoa Theobroma cacao TcaGI 3.0 17424 24 31514
Cotton Gossypium CGI 10.0 50069 80 66367
Cotton_raimondii Gossypium raimondii GoraGI 1.0 9508 0 15383
Grape Vitis vinifera VvGI 6.0 33638 14825 30513
NIH-PA Author Manuscript

Ice_plant Mesembryanthemum crystallinum McGI 5.0 3627 66 6706


L.japonicus Lotus japonicus LjGI 5.0 21367 39 20996
Leafy_spurge Euphorbia esula EuesGI 1.0 10727 8 15761
Lettuce Lactuca sativa LsGI 3.0 12505 71 17309
Maize Zea mays ZmGI 19.0 112156 310 202621
Medicago Medicago truncatula MtGI 9.0 29273 11494 26696
Morning_glory Ipomoea nil IpniGI 1.0 11754 39 9721
Moss Physcomitrella patens subsp.patens PpspGI 2.0 30695 16670 19787
N.benthamiana Nicotiana benthamiana NbGI 3.0 5861 106 10160
Oilseed_rape Brassica napus BnGI 3.1 47634 59 42634
Orange Citrus sinensis CsGI 1.0 26081 26 72791

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 24

Common name Species name GI name Release TCs sETs sESTs


NIH-PA Author Manuscript

Peach Prunus persica PrpeGI 2.0 9633 42 16412


Pepper Capsicum annuum CaGI 4.0 14747 104 17568
Petunia Petunia hybrida PhGI 2.0 2230 42 6457
Pine Pinus PGI 7.0 34181 145 27538
Poplar Populus PplGI 4.0 49638 249 49764
Potato Solanum tuberosum StGI 12.0 31567 186 29619
Prickly_lettuce Lactuca serriola LaseGI 1.0 8047 0 13958
Rice Oryza sativa OsGI 17.0 77158 19426 85212
Robusta_coffee Coffea canephora CocaGI 1.0 7420 6 10206
Rye Secale cereale RyeGI 4.0 1471 78 4038
Scarlet bean Phaseolus coccineus PcGI 1.0 22518 1 50410
Sorghum Sorghum bicolor SbGI 9.0 23442 257 22326
Soybean Glycine max GmGI 14.0 70880 161 62508
Spruce Picea Sgi 3.0 42051 39 38404
Sugarcane Saccharum officinarum SoGI 2.2 40016 43 76529
NIH-PA Author Manuscript

Sunflower Helianthus annuus HaGI 6.0 20130 269 32717


Switchgrass Panicum virgatum PaviGI 1.0 52936 0 32286
T.versicolor Triphysaria versicolor TverGI 2.0 7165 3 5644
Tall_fescue Festuca arundinacea FaGI 2.0 6686 10 13241
Tobacco Nicotiana tabacum NtGI 5.0 37223 237 63781
Tomato Solanum lycopersicum LeGI 12.0 25764 201 20884
Triphysaria Triphysaria TriphGI 1.0 17442 0 17043
Wheat Triticum aestivum TaGI 11.0 91464 256 124732
Protist (15 species)
C.parvum Cryptosporidium parvum CpGI 5.1 3833 146 48
D.discoideum Dictyostelium discoideum DdGI 5.1 13819 520 3627
E.tenella Eimeria tenella EtGI 5.0 2992 201 5191
Leishmania Leishmania LshGI 6.0 16048 3702 5538
N.caninum Neospora caninum NcGI 5.1 2131 5 3878
P.berghei Plasmodium berghei PbGI 6.0 13433 202 10899
P.falciparum Plasmodium falciparum PfGI 9.0 8919 1773 4536
NIH-PA Author Manuscript

P.vivax Plasmodium vivax PvGI 2.1 4387 2839 2481


P.yoelii Plasmodium yoelii PyGI 5.1 7727 29 2770
S.neurona Sarcocystis neurona SnGI 6.0 1053 0 2596
T.brucei Trypanosoma brucei TbGI 5.1 5203 3930 1367
T.cruzi Trypanosoma cruzi TcGI 6.0 12319 186 2742
T.gondii Toxoplasma gondii TgGI 9.0 10184 197 15498
T.thermophila Tetrahymena thermophila TtGI 5.0 12363 15594 6709
T.vaginalis Trichomonas vaginalis TvGI 2.1 4740 24227 461
Fungi (10 species)
A.flavus Aspergillus flavus AfGI 5.0 4026 41 4070
A.nidulans Aspergillus nidulans AnGI 5.0 3662 6788 3106

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 25

Common name Species name GI name Release TCs sETs sESTs


NIH-PA Author Manuscript

C.posadasii Coccidioides posadasii CpoGI 2.1 6893 5522 2288


Cryptococcus Cryptococcus neoformans CrGI 8.0 8430 157 1447
F.verticillioides Fusarium verticillioides FvGI 8.0 8510 33 4807
M.grisea Magnaporthe grisea MgGI 6.0 13984 5964 10760
N.crassa Neurospora crassa NcrGI 4.1 10927 2092 1477
Potato_late_blight Phytophthora infestans PhinGI 1.0 12149 1 24321
S.cerevisiae Saccharomyces cerevisiae ScGI 4.0 4382 1443 197
S.pombe Schizosaccharomyces pombe SpGI 3.0 2449 2974 510
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 26
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.1.
The DFCI Gene Index home page at http://compbio.dfci.harvard.edu/tgi/tgipage.html has
links to the 114 species-specific databases currently available. Other resources available
include the Eukaryotic Gene Ortholog (EGO) database, the RESOURCERER utility for
annotating and cross-referencing mammalian microarray resources, and maps of the TCs to
completed genome sequences.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 27
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.2.
The home page for the Maize Gene Index.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 28
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.3.
The BLAST search page allows users to query any of the DFCI Gene Index databases, as
well as the EGO and RESOURCERER databases, using protein or DNA sequences.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 29
NIH-PA Author Manuscript

Figure 1.6.4.
The main search page for the Maize Gene Index allows users to search the database using a
variety of accession numbers, including DFCI TC number, a Transcript Identifier, GenBank
NIH-PA Author Manuscript

Accessions, and clone identifiers.


NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 30
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.5.
Gene Ontology (GO) terms and Enzyme Commission (EC) identifiers are assigned to the
TCs to provide functional annotation and to provide links to metabolic pathway databases.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 31
NIH-PA Author Manuscript

Figure 1.6.6.
NIH-PA Author Manuscript

The GO browser shows the hierarchy of functional assignments for TCs identified as
members of a particular functional class.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 32
NIH-PA Author Manuscript

Figure 1.6.7.
NIH-PA Author Manuscript

For humans, mouse, and rat, TCs are mapped to their respective genomes using the available
radiation hybrid maps.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 33
NIH-PA Author Manuscript

Figure 1.6.8.
RH Mapping Data. A snippet of Mouse TCs containing markers mapped to chromosome 1.
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 34
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.9.
The expression summary page allows each Gene Index database to be explored using
information on the libraries from which the ESTs were derived.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 35
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.10.
The Expression Search page allows the frequency of ESTs from various libraries to be
compared in order to identify differentially expressed genes based on the sources of libraries
from which the ESTs were derived.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 36
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.11.
An example of a library-based expression comparison. The relative abundance of ESTs is
depicted using a hot/cold (red/blue) color map and significant differences between classes of
ESTs are denoted by the associated R statistic (Stekel et al., 2000). For the color version of
this figure go to http://www.currentprotocols.com/protocol/bi0106.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 37
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.12.
Gbrowse. ESTs from the various plant Gene Index databases are aligned to the Arabidopsis
thaliana genome sequence.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 38
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.13.
The home page for the Eukaryotic Gene Ortholog (EGO) database.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 39
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 40
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.14.
A TOG alignment from the EGO database showing alignments of a possible transcription
factor from A. Salmon, C.posadaii, cattle, dog, Medicago, oilseed rape, and Trout. (A)
Shows a table with all TC components of the group and their putative function. The next
table shows the blast results. (B) Shows a snippet of the sequence alignments.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 41
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.15.
The RESOURCERER home page allows users to select a variety of widely used microarray
resources for human, mouse, and rat for annotation or cross-platform and cross-species
comparisons. Users can also enter their own microarray platform for annotation by
providing GenBank accession numbers.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 42
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.16.
Annotation for the Affymetrix HG U95Av2 provided by RESOURCER includes Affymetrix
Probe IDs, Clone names (when available), GenBank accessions, UniGene identifiers, DFCI
TC numbers for human identified though EGO, GO terms, and annotated function, Physical
map location based on alignments of the DFCI THCs, with links to the appropriate
databases.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 43
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.17.
RESOURCERER also allows microarray platforms to be compared. Here, annotations for
Affymetrix HG U95Av2 and HG U95C human GeneChip are compared through EGO. Only
elements in common to both datasets are shown (intersection).The annotation includes
Affymetrix Probe IDs, Clone names when available, GenBank IDs with links to NCBI, the
TGI TC numbers for Human (THCs).
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 44
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 45
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 46
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.18.
A sample TC report for Aedes Aegypti TC57832. (A) At the top of each record is a FASTA-
formatted sequence representing the consensus produced by the clustering and assembly
process. Immediately following that are predicted open reading frames, a graphical
representation of the EST, and gene sequences that comprise the TC. (B) Shows a table with
NIH-PA Author Manuscript

links to a variety of resources including GenBank records, source laboratory etc; it also
shows a prediction of the coding strand and the evidence used to support the assignment. (C)
Buttons provide links to expression summaries based on the libraries represented in each TC
assembly, SNPs identified in the TC, and predicted 70-mers oligos. Links to the top 5 results
of the searches against a protein database, GO term and EC number assignments, and links
to Metabolic Pathways in KEGG, are also given.

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Antonescu et al. Page 47
NIH-PA Author Manuscript
NIH-PA Author Manuscript

Figure 1.6.19.
A schematic overview of the Gene Index Assembly process. For each species represented,
EST sequences are downloaded from the dbEST database at the NCBI (http://
www.ncbi.nlm.nih.gov/dbEST). Sequences are cleaned to remove contaminating vector,
adapter, mitochondrial, ribosomal, and other sequences wherever possible. Coding
sequences (annotated CDS regions) representing genes are parsed from GenBank records.
All EST and gene sequences are compared pairwise using megaBLAST and grouped based
on shared sequence similarity. Each cluster is then assembled at high stringency to produce
Tentative Consensus (TC) sequences, which are annotated by sequence similarity search
against a local copy of UNIPROT, and released through the DFCI Web site.
NIH-PA Author Manuscript

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Table 1.6.1

Summary of the Gene Index Databases Mapped to Completed and Draft Genomes

Genome Gene indices mapped to that genome


Human HGI, MGI, RGI, BtGI, ScGI
Antonescu et al.

Mouse HGI, MGI, RGI, BtGI, SsGI, DGI, CeGI, AtGI, ScGI
Rat HGI, MGI, RGI, BtGI, SsGI, DGI, CeGI, AtGI, ScGI
Fly DGI, HGI, CeGI, AtGI, ScGI
Worm DGI, HGI, CeGI, AtGI, ScGI
Mosquito AgGI, HGI, DGI, CeGI
Fugu HGI, MGI, RGI, OlGI, XGI, ZGI
Arabidopsis CGI, AtGI, LGI, StGI, GmGi, MtGI, McGI, OGI, ZmGI TaGI, SbGI, HvGI
Yeast ScGI, SpGI, CrGI, NcrGI, AnGI, DGI, HGI, CeGI, AtGI

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2014 October 21.
Page 48

You might also like