1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Difference between Prokaryotic and Eukaryotic Cells
Though these two classes of cells are quite different, they do possess some common
characteristics. For instance, both possess cell membranes and ribosomes, but the similarities
end there. The complete list of differences between prokaryotic and eukaryotic cells is
summarized as follows:
Prokaryotes Eukaryotes
Type of Cell Always unicellular Unicellular and multi-cellular
Cell size Ranges in size from 0.2 μm – 2.0 μm in Size ranges from 10 μm – 100 μm in
diameter diameter
Cell wall Usually present; chemically complex in When present, chemically simple in
nature nature
Nucleus Absent. Instead, they have a nucleoid Present
region in the cell
Ribosomes Present. Smaller in size and spherical in Present. Comparatively larger in size
shape and linear in shape
DNA Circular Linear
arrangement
Mitochondria Absent Present
Cytoplasm Present, but cell organelles absent Present, cell organelles present
Endoplasmic Absent Present
reticulum
Plasmids Present Very rarely found in eukaryotes
Ribosome Small ribosomes Large ribosomes
Lysosome Lysosomes and centrosomes are absent Lysosomes and centrosomes are
present
Cell division Through binary fission Through mitosis
Flagella The flagella are smaller in size The flagella are larger in size
Reproduction Asexual Both asexual and sexual
Example Bacteria and Archaea Plant and Animal cell
A sequence of four nucleotides arranged in a certain pattern, encoding information, constitutes
an organism’s genetic material or DNA (deoxyribonucleic acid). The linear arrangement of
DNA components and their partition into chromosomes is referred to as a genomic
20
organisation. “Genome organisation” can also refer to the arrangement of DNA sequences
inside the nucleus and the three-dimensional structure of chromosomes.
Eukaryotic genomes are linear and follow the Watson-Crick Double Helix structural model.
They are contained within chromosomes, bundles of DNA and proteins (Histone) known as
nucleosomes. The protein-coding genes in eukaryotic genomes are organised in exons and
introns, which represent the coding sequence and intervening sequence, respectively,
indicating the functionality of the RNA section of the genome.
Eukaryotic Genome Configuration
The eukaryotic genome configuration consists of protein-coding regions, gene regulatory
regions, gene-related sequences, and intergenic DNA or extra genic DNA, which comprises
low copy number and moderate or high copy number repetitive sequences. The configuration
is shown in the flowchart below.
Eukaryotic Genome Organisation
● Eukaryotic genomes include two characteristics that pose a significant information
processing problem.
1. The standard multicellular eukaryotic cell has a substantially larger genome than a
prokaryotic cell.
2. Many genes can only be expressed in certain types of cells due to cell specialisation.
● A huge amount of DNA that does not direct the synthesis of RNA or protein is included in
the reported 35,000 genes in the human genome.
● The eukaryotic DNA is intricately organised. The DNA-protein complex known as
chromatin is not only linked to proteins but is also structured at a higher structural level
than the DNA-protein complex in prokaryotes.
● Eukaryotic cells have a significantly higher concentration of DNA in their nuclei than
prokaryotic cells.
The C-value is the amount of DNA in the haploid genome of an organism. It varies
over a very wide range, with a general increase in C-value with complexity of organism
from prokaryotes to invertebrates, vertebrates, plants. The C-value paradox is
basically this: how can we account for the amount of DNA in terms of known function?
Very similar organisms can show a large difference in C-values (e.g. amphibians). The
amount of genomic DNA in complex eukaryotes is much greater than the amount
needed to encode proteins. For example: Mammals have 30,000 to 50,000 genes, but
their genome size (or C-value) is 3 x 109 bp.
3×109base pairs3000base paires (average gene size)=1×106(“gene
capacity”).(4.5.1)(4.5.1)3×109base pairs3000base paires (average gene
size)=1×106(“gene capacity”).
21
Drosophila melanogaster has about 5000 mutable loci (~genes). If the average size
of an insect gene is 2000 bp, then
1×108base pairs2×103base pairs=>50,000 “gene capacity”.(4.5.2)
Genomes
Learning Objectives
1. Define “genome” as “the complete set of genes or genetic material present in a cell or
organism”
2. Contrast the size and organization of prokaryotic versus eukaryotic genomes
3. Explain why genome size does not predict organismal complexity or phylogeny, and vice
versa
4. Describe the content of the human and mammalian genomes
5. Describe the current and potential applications of massively parallel DNA sequencing
technology
One of the defining and essential features of life is genetic material. An
organism’s genome is the complete set of all genes and genetic material that is present
in that organism or individual cell. Often we think of genes in terms of protein-coding
genes, or genes that are transcribed into mRNAs and then translated into protein;
however, genomes consist of a lot more than just protein coding genes. In addition,
the features of prokaryotic and eukaryotic genomes differ in terms of both size and
content.
The image below shows the different ranges of genome sizes in different taxonomic
groups of life. Note that, in general, prokaryotic genomes are smaller than eukaryotic
genomes. However, eukaryotic genome sizes vary wildly and are not linked to
organismal “complexity.” Refer to this diagram as you read on about the differences
and similarities between prokaryotic and eukaryotic genomes.
Prokaryotic Genomes
▪ The genomes of Bacteria and Archaea are compact; essentially all of their DNA is
“functional” (contains genes or gene regulatory elements).
▪ The sizes of prokaryotic genomes ranges from about 1 million to 10 million base pairs of
DNA, usually in a single, circular chromosome
▪ Genes in a biochemical pathway or signaling pathway are often clustered together and
arranged into operons, where they are transcribed as a single mRNA that is translated to
make all the proteins in the operon.
▪ The size of prokaryotic genomes is directly related to their metabolic capabilities – the
more genes, the more proteins and enzymes they make.
Eukaryotic Genomes
22
▪ The genome sizes of eukaryotes are tremendously variable, even within a taxonomic group
(so-called C-value paradox).
▪ Eukaryotic genomes are divided into multiple linear chromosomes; each chromosome
contains a single linear duplex DNA molecule.
▪ Eukaryotic genes in a biochemical or signaling pathway are not organized into operons;
one mRNA makes one protein.
▪ Many eukaryotic genes (most human genes) are split; non-coding introns must be
removed and the exons spliced together to make a mature mRNA. Introns are
“intervening” sequences in genes that do not code for proteins. The image below shows a
zoomed-in region of a gene highlighting the alternating exons and introns.
▪ The multiple exons in a eukaryotic gene can be spliced in different ways to make multiple
mRNAs and multiple proteins from a single gene (alternative splicing).
▪ The majority of human genes can be spliced in two or more different ways. Therefore, the
actual number of human proteins far exceeds the number of protein-coding genes.
▪ Alternative splicing often results in “tissue-specific” versions of the same gene, where one
splice variant is present in, for example, cardiac muscle, while a different splice variant of
the same gene is present in skeletal muscle. The image below shows one hypothetical gene
with 3 different possible proteins depending on which exons are included in the final
mRNA.
23
One gene is transcribed, then spliced in different ways to produce mRNAs that encode related proteins from
different exon combinations. http://www.genome.gov/Images/EdKit/bio2j_large.gif
What accounts for the variation in genome size?
There is no good correlation between the body size or complexity of an organism and
the size of its genome. Eukaryotic genomes sequenced thus far have between ~6,000
and ~30,000 protein-coding genes, or less than 10-fold variation in the number of
genes. The human genome has about 21,000 protein-coding genes (recently revised to
as few as ~19,000 genes). Therefore, the 10,000-fold variation in eukaryotic genome
size is due mostly to varying amounts of non-coding DNA.
Here is a quick comparison of the genome size and predicted gene number for a
sampling of eukaryotes:
chromosome base predicted number of
organism number (diploid) pairs genes
Saccharomyces 6,275
cerevisiae (budding yeast) 16 1.25×107 (~5,800 functional)
Drosophila
melanogaster (fruit fly) 8 1.65×108 13,600
Caenorhabditis
elegans (nematode
worm) 6 1.0×108 ~19,000
24
Canis familiaris (dog) 78 2.4×109 ~19,000
Homo sapiens (human) 46 3.3×109 ~19,000
Mus musculus (mouse) 40 3.4×109 ~20,000
Oryza sativa (rice) 24 4.66×108 ~37,000
It’s very interesting to note that humans have about the same number of genes as the
microscopic nematode worm, C. elegans, and fewer genes than rice.
What’s in the human genome?
The content of the human genome, from Wikipedia
▪ Protein-coding (exon) DNA sequences comprise less than 2% of the human genome.
▪ Introns make up just over 1/4 of the human genome.
▪ Transposable elements and DNA derived from them make up about 1/2 of the human
genome. Transposable elements are essentially “parasitic” DNA that resides in a host
genome, taking up space in the genome but not contributing useful or functional sequences
to the genome. They are the DNA transposons, LTR retrotransposons, LINEs and SINEs.
25
▪ Because they are parasitic DNA elements, transposable elements are extremely valuable
for studying evolutionary relationships. If a transposable element “invades” an organism’s
genome, then it is likely to remain in that genome as the population evolves and when
speciation occurs. If the same transposable element is present in the same location in the
genomes of two different species, this is strong evidence that those two species share a
recent common ancestor who also had the transposable element in its genome.
▪ One family of SINEs, called the Alu element, is a 300-nucleotide sequence that is present
in over 1 million copies in human and chimpanzee genomes.
▪ Segmental duplications are relatively long (> 1 kb; kb = 1,000 bp) segments of DNA that
have become duplicated. These duplications create copies of genes that can mutate and
acquire new functions. Gene families (e.g., alpha- and beta-hemoglobin, myoglobin) arose
this way.
Is the human genome 80% “junk” or 80% functional?
Recent publication of data and papers from the ENCODE project, a systematic survey
of the human genome variation and activity from chromatin modifications to
transcription, has claimed that, contrary to previous belief, fully 80% of the human
genome has at least some biochemical activity, such as transcription (The ENCODE
Project Consortium, 2012). Indeed, many small RNAs, called microRNAs (miRNAs)
with important regulatory roles are transcribed from intergenic regions. However,
these miRNAs and other regulatory RNAs comprise less than 1% of the human
genome, and other studies have indicated that only 10% of the genome appears to be
subject to some evolutionary constraint (review by Palazzo and Gregory, 2014).
About RefSeq
The Reference Sequence (RefSeq) collection provides a comprehensive, integrated,
non-redundant, well-annotated set of sequences, including genomic DNA, transcripts,
and proteins. RefSeq sequences form a foundation for medical, functional, and
diversity studies. They provide a stable reference for genome annotation, gene
identification and characterization, mutation and polymorphism analysis
(especially RefSeqGene records), expression studies, and comparative analyses.
[ more... ]
RefSeq genomes are copies of selected assembled genomes available in GenBank.
RefSeq transcript and protein records are generated by several processes including:
● Computation
o Eukaryotic Genome Annotation Pipeline
o Prokaryotic Genome Annotation Pipeline
● Manual curation
● Propagation from annotated genomes that are submitted to members of
the International Nucleotide Sequence Database Collaboration (INSDC)
Scope
NCBI provides RefSeqs for taxonomically diverse organisms including archaea,
bacteria, eukaryotes, and viruses. References sequences are provided for genomes,
transcripts, and proteins. Some targeted loci projects are included in RefSeq
26
including: RefSeqGene, fungal ITS, and rRNA loci. New or updated records are added
to the collection as data become publicly available.
RefSeq Growth Statistics
Data Access and Availability
RefSeq is accessible via BLAST , Entrez, and the NCBI FTP site (RefSeq releases,
and RefSeq Genomes). Information is also available in NCBI's Assembly, Genomes
and Gene resources, and for some organisms additional information is available in
NCBI's genome browser Genome Data Viewer. Special properties have been
defined to facilitate Entrez-based retrieval. See also: Entrez Query Hints
Distinguishing Features
The main features of the RefSeq collection include:
● non-redundancy
● explicitly linked nucleotide and protein sequences
● updates to reflect current knowledge of sequence data and biology
● data validation and format consistency
● distinct accession series (all accessions include an underscore '_' character)
● ongoing curation by NCBI staff and collaborators, with reviewed records
indicated
RefSeq Production Processes and Policy
RefSeq records are derived from publicly available sequence data; varying levels of
validation, additional annotation, and manual curation are applied to the RefSeq
record. NCBI Reference Sequences are provided through the separate processes
described below.
This page provides a brief overview of the RefSeq production processes. Also
see: NCBI Handbook, RefSeq chapter NCBI Handbook, Genome Annotation
chapter RefSeq Prokaryotic Genomes Eukaryotic genome annotation policy
Collaboration
For some organisms, the annotated RefSeq records are provided by collaborating
groups. Depending on the organism, collaborations may be established at the whole-
genome level, or smaller collaborations may be established for gene families.
Whole-genome collaborations include records for Saccharomyces
cerevisiae , Arabidopsis thaliana , Drosophila melanogaster , and Caenorhabditis
elegans . When such a collaboration is established, the primary sequence level review
is carried out by the collaborating group. Processing of annotated genome data
submitted by collaborations is semi-automated; data is provided by a collaborating
group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location
is not capable of encoding the provided protein), and to apply the annotation in a more
uniform way. NCBI processing may integrate additional information such as
nomenclature or other descriptive data. Additional manual curation of these records is
not carried out by NCBI staff. NCBI may update the records to correct a general format
problem, but otherwise these records are only updated when the collaborating group
provides an update. Should errors be reported, then NCBI staff relays that information
to the collaborating group.
27
RefSeq records that are supplied by collaboration do include an indication of the
submitting group on the record either as a direct submission Reference citation and/or
in the COMMENT block. The RefSeq status (e.g., REVIEWED etc) is either indicated
by the collaborating group, or is inferred based on the supplied annotation.
Genome Assembly & Annotation Pipeline
NCBI is providing annotation for some assembled genomic sequence data including
human, mouse, rat, honey bee, chicken, chimpanzee (and others). This pipeline is
automated and data is refreshed periodically. The model RefSeq records produced
from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived
from the genomic sequence, have varying levels of transcript or protein homology
support, and are not subject to further manual curation.
Summary
NCBI’s Reference Sequence (RefSeq) database is a collection of taxonomically
diverse, non-redundant and richly annotated sequences representing naturally
occurring molecules of DNA, RNA, and protein. Included are sequences from
plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq is
constructed wholly from sequence data submitted to the International Nucleotide
Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq is a
synthesis of information integrated across multiple sources at a given time. RefSeqs
provide a foundation for uniting sequence data with genetic and functional information.
T hey are generated to provide reference standards for multiple purposes ranging from
genome annotation to reporting locations of sequence variation in medical records.
The RefSeq collection is available without restriction and can be retrieved in several
different ways, such as by searching or by available links in NCBI resources, including
PubMed, Nucleotide, Protein, Gene, and Map Viewer, searching with a sequence via
BLAST, and downloading from the RefSeq FTP site.
28
Gencode The goal of the GENCODE project is to identify and classify all gene
features in the human and mouse genomes with high accuracy based on biological
evidence, and to release these annotations for the benefit of biomedical research and
genome interpretation.
Background
The National Human Genome Research Institute (NHGRI) launched a public research
consortium named ENCODE, the Encyclopedia Of DNA Elements, in September
2003, to carry out a project to identify all functional elements in the human genome
sequence. After a successful pilot phase on 1% of the genome, the scale-up to the
entire genome is now underway. The Wellcome Sanger Institute was awarded a
grant to carry out a scale-up of the GENCODE project for integrated annotation of
gene features.
Having been involved in successfully delivering the definitive annotation of functional
elements in the human genome, the GENCODE group were awarded a second
grant in 2013 in order to continue their human genome annotation work and expand
GENCODE to include annotation of the mouse genome. A third grant was awarded in
2017 for the continued improvement of the annotation of the human and mouse
genomes.
The GENCODE gene sets are used by the entire ENCODE consortium and by many
other projects (eg. Genotype-Tissue Expression (GTEx), The Cancer Genome Atlas
(TCGA), International Cancer Genome Consortium (ICGC), NIH Roadmap
Epigenomics Mapping Consortium, Blueprint Epigenome Project, Exome Aggregation
Consortium (EXAC), Genome Aggregation Database (gnomAD), 1000 Genomes
Project and the Human Cell Atlas (HCA)) as reference gene sets.
Current GENCODE Goals
The aims of the current GENCODE phase running from 2017 to 2021 are:
● To continue to improve the coverage and accuracy of the GENCODE human and
mouse gene sets by enhancing and extending the annotation of all evidence-based
gene features in the human genome at a high accuracy, including protein-coding
loci with alternatively splices variants, non-coding loci and pseudogenes.
The process to create this annotation involves manual curation, computational
analysis and targeted experimental approaches.
The human and mouse GENCODE resources will continue to be available to the
research community with regular releases of Ensembl genome browser and the UCSC
genome browser will continue to present the current release of the GENCODE gene
set.
Participants, PI & Co-PIs
● Paul Flicek (Lead PI), EMBL European Bioinformatics Institute, Cambridge, UK
29
● Roderic Guigo (PI), Centre de Regulació Genòmica (CRG), Barcelona, Catalonia,
Spain
● Manolis Kellis (PI), Massachusetts Institute of Technology (MIT), Boston, USA
● Mark Gerstein (PI), Yale University, New Haven, USA
● Benedict Paten (PI), University of California, Santa Cruz, California, USA
● Michael Tress, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
● Jyoti Choudhary, Institute of Cancer Research (ICR), London, UK
UCSC
The University of California Santa Cruz (UCSC) Genome Browser (genome.ucsc.edu)
is a popular Web-based tool for quickly displaying a requested portion of a genome at
any scale, accompanied by a series of aligned annotation “tracks”. The annotations—
generated by the UCSC Genome Bioinformatics Group and external collaborators—
display gene predictions, mRNA and expressed sequence tag alignments, simple
nucleotide polymorphisms, expression and regulatory data, phenotype and variation
data, and pairwise and multiple-species comparative genomics data. All information
relevant to a region is presented in one window, facilitating biological analysis and
interpretation. The database tables underlying the Genome Browser tracks can be
viewed, downloaded, and manipulated using another Web-based application, the
UCSC Table Browser. Users can upload data as custom annotation tracks in both
browsers for research or educational use. This unit describes how to use the Genome
Browser and Table Browser for genome analysis, download the underlying database
tables, and create and display custom annotation tracks.
30