Final Doc Genomics
Final Doc Genomics
Candidates Declaration…………………………………………………………………….(ii)
Certificate…………………………………………………………………………………..(iii)
Acknowledgement………………………………………………………………………….(iv)
1. Introduction.......................................................................................................................7
2.BASICS OF GENOMICS...................................................................................................11
CONCLUSION.................................................................................................................30-31
REFERENCES.......................................................................................................................32
Chapter 1
1. Introduction
Genomics, the study of the complete set of genetic material within an organism, has emerged
as a cornerstone in modern biology, revolutionizing our understanding of life at the molecular
level. The unraveling of the human genome and advancements in genome sequencing
technologies have paved the way for unprecedented insights into genetic diversity, evolution,
and the molecular basis of diseases.
The field also includes studies of intragenomic (within the genome) phenomena such
as epistasis (effect of one gene on another), pleiotropy (one gene affecting more than one
trait), heterosis (hybrid vigour), and other interactions between loci and alleles within the
genome.
As genomics continues to evolve, so does the need for sophisticated tools and methodologies
to interpret and harness the vast amount of genomic data generated. In this context, Artificial
Intelligence (AI) and Machine Learning (ML) have emerged as transformative technologies,
offering novel solutions to the challenges posed by the complexity and scale of genomic
information.
7
Chapter 1
The implications of genomics in healthcare are profound. By understanding the genetic basis
of diseases, researchers and clinicians can identify genetic markers associated with various
conditions, enabling more accurate diagnosis and prognosis. Moreover, the advent of
precision medicine leverages genomic information to tailor treatments to an individual's
genetic makeup, increasing the efficacy and minimizing adverse effects.
A gene is the basic physical and functional unit of heredity. Genes are made up of DNA.
Some genes act as instructions to make molecules called proteins. However, many genes do
not code for proteins. In humans, genes vary in size from a few hundred DNA bases to more
than 2 million bases. An international research effort called the Human Genome Project,
which worked to determine the sequence of the human genome and identify the genes that it
contains, estimated that humans have between 20,000 and 25,000 genes.
DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other
organisms. Nearly every cell in a person’s body has the same DNA. Most DNA is located in
the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be
found in the mitochondria (where it is called mitochondrial DNA or mtDNA). Mitochondria
are structures within cells that convert the energy from food into a form that cells can use.
8
Chapter 1
Cells are the basic building blocks of all living things. The human body is composed of
trillions of cells. They provide structure for the body, take in nutrients from food, convert
those nutrients into energy, and carry out specialized functions. Cells also contain the body’s
hereditary material and can make copies of themselves.
Cells have many parts, each with a different function. Some of these parts, called organelles,
are specialized structures that perform certain tasks within the cell.
Genetics and genomics are two terms that are often incorrectly used interchangeably.
Genetics is the study of single genes and their role in the way traits or conditions are passed
from one generation to the next. Genomics is a term that describes the study of all parts of an
organism’s genes.
Genetics
Genetics is a scientific study of the effects that genes — which are units of heredity — have
on an individual. Genes hold information in the molecule DNA, which is a string of
chemicals called bases. The order, or sequence, of bases on the string determines the meaning
of a genetic message. The message contains instructions for making proteins, which, in turn,
direct cells and functions of the body. Humans have thousands of genes that are packaged
into 23 pairs of chromosomes.
Genomics
All of the genes of an organism taken together, plus all of the sequences and information
contained therein, are called the genome. The human genome consists of all of the thousands
of genes and the 23 chromosome pairs. Genomics includes study of how the genes within the
genome interact with each other and with the individual’s environment.
Researchers may conduct genetic or genomic tests. Genetic testing is when the researchers
investigate a single piece of genetic information for specific bits of DNA with a known
function. By investigating a single known entity, scientists may isolate the underlying causes
of the specific genetic variant in question. Genomic testing is broader, with no target.
9
Chapter 1
Genomic testing involves investigating large sections of genetic material and information,
from which broad or specific conclusions may be drawn.
Some examples of genetic or inherited disorders include cystic fibrosis, Down syndrome,
hemophilia, Huntington’s disease, phenylketonuria (PKU) and sickle-cell disease.
Some disorders and complex diseases that have been studied in the field of genomics include
asthma, cancer, diabetes and heart disease. These diseases are caused by a combination of
genetic and environmental factors, rather than simply a single genetic defect. The study of
genomics has provided the medical community with new diagnostic tools and therapies for
these complex diseases.
10
Chapter 2
2.BASICS OF GENOMICS
Genomics, a field within molecular biology, involves the study of the entire set of genetic
material within an organism, encompassing its DNA, genes, and non-coding regions. The
term "genome" refers to the complete set of genetic instructions that an organism inherits
from its parents. Genomics extends beyond individual genes, aiming to understand the
structure, function, evolution, and interactions of all the genetic elements within a given
species.
The iconic double helical structure of DNA, discovered by James Watson and Francis Crick
in 1953, serves as the molecular backbone of genomics. Each DNA strand is composed of
nucleotides, with a sugar-phosphate backbone and nitrogenous bases projecting inward. The
specific sequence of these bases encodes genetic information. Adenine pairs with thymine,
forming a stable base pair, while guanine pairs with cytosine. This complementary base-
pairing ensures the faithful transmission of genetic information during processes like DNA
replication.
Beyond its role as a genetic blueprint, DNA is a dynamic molecule involved in various
cellular processes. Genes, segments of DNA that code for proteins, contribute to the synthesis
of molecules essential for cell structure, function, and regulation. Non-coding regions of
DNA, once considered "junk DNA," are now recognized for their regulatory roles,
influencing gene expression and contributing to the complexity of cellular processes.
11
Chapter 2
The evolution of genome sequencing techniques has been a transformative force in genomics
research. Early methods, such as Maxam-Gilbert sequencing and Sanger sequencing, were
groundbreaking but limited in their scalability and efficiency. The advent of Next-Generation
Sequencing (NGS) technologies in the 21st century marked a paradigm shift, enabling the
simultaneous sequencing of millions to billions of DNA fragments.
NGS platforms, including Illumina, Ion Torrent, and Pacific Biosciences, utilize various
sequencing-by-synthesis or sequencing-by-ligation approaches. These technologies offer
high-throughput capabilities, allowing researchers to sequence entire genomes quickly and
cost-effectively. The Human Genome Project, completed in 2003, exemplifies the impact of
these advancements, providing a reference genome that serves as a cornerstone for further
genomic exploration.
Genomes exhibit remarkable diversity within and between species, manifesting as genomic
variations. These variations can take the form
12
Chapter 2
Functional genomics is the science that studies, on a genomewide scale, the relationships
among the components of a biological system - genes, transcripts, proteins, metabolites, etc. -
and how these components work together to produce a given phenotype. The term ”functional
genomics” takes root in the scientific community at the time of the rising of the first genome
sequencing projects. These projects are ultimately aimed at determining the complete genome
sequence of a given organism and to annotate functionally relevant features therein, such as
protein-coding and non-coding genes as well as DNA regulatory regions. The landmark such
endeavour is the Human Genome Project (HGP),1 a worldwide collaborative project
launched in 1990 and officially completed in 2003 (International Human Genome
Sequencing Consortium [33]). However, the first completely sequenced genome from a
eukaryote, that of the budding yeast Saccharomyces cerevisiae, was released already in 1996
[34] and provided material to start exploring the complex relationships between genes and
gene products at the genome scale. Indeed, a tentative definition of functional genomics was
first published in 1997 by Hieter and Boguski [35], that at the beginning of their paper state:
‘‘An informal poll of colleagues indicates that the term [functional genomics] is widely used,
but has many different interpretations. There is even some sentiment that the term is
unnecessary and that it does nothing more than refer to biological research as a whole.”
Beyond the static representation of the genome, functional genomics explores how genes and
their interactions contribute to biological functions. It involves deciphering the roles of
individual genes, understanding gene regulation, and uncovering the networks that govern
cellular processes. Systems biology, an interdisciplinary approach, integrates genomics data
13
Chapter 2
Functional genomics employs tools such as gene expression profiling, RNA interference
(RNAi), and CRISPR-Cas9 gene editing to unveil the functions of genes and their products.
By dissecting the molecular mechanisms underlying cellular processes, functional genomics
provides valuable insights into the relationships between genotype and phenotype.
Understanding functional genomics and systems biology is pivotal for comprehending the
complexity of living organisms and elucidating the molecular basis of health and disease. As
genomics advances, these approaches contribute to a more nuanced understanding of the
interplay between genes and their functional outcomes.
14
Chapter 3
3. INTRODUCTION TO AI & ML
Artificial intelligence (AI) is the development of computer systems that are able to perform
tasks that normally require human intelligence. Advances in AI software and hardware,
especially deep learning algorithms and the graphics processing units (GPUs) that power their
training, have led to a recent and rapidly increasing interest in medical AI applications. In
clinical diagnostics, AI-based computer vision approaches are poised to revolutionize image-
based diagnostics, while other AI subtypes have begun to show similar promise in various
diagnostic modalities.
In some areas, such as clinical genomics, a specific type of AI algorithm known as deep
learning is used to process large and complex genomic datasets. In this review, we first
summarize the main classes of problems that AI systems are well suited to solve and describe
the clinical diagnostic tasks that benefit from these solutions. Next, we focus on emerging
methods for specific tasks in clinical genomics, including variant calling, genome annotation
and variant classification, and phenotype-to-genotype correspondence. Finally, we end with a
discussion on the future potential of AI in individualized medicine applications, especially for
risk prediction in common complex diseases, and the challenges, limitations, and biases that
must be carefully addressed for the successful deployment of AI in medical applications,
particularly those utilizing human genetics and genomics data.
15
Chapter 3
system, for example, clinical images that have been labeled and interpreted by a human
expert. The AI system then learns to execute the interpretation task on new health data of the
same type, which in clinical diagnostics is often the identification or forecasting of a disease
state.
Machine learning (ML) and deep learning are fields of study frequently mentioned in the
context of AI. Both kinds of learning are subfields of AI. Machine learning is a process by
which machines can be given the capability to learn about a given dataset without being
explicitly programmed on what to learn.
Machines can usually learn in either a supervised or unsupervised manner. Under supervised
learning, scientists provide machines with separate training and test data sets. The training
data has defined categories (e.g., people with coronary heart disease and those without) that
the machine can use to infer hidden qualities of the data and distinguish the categories from
each other. It is then able to use this knowledge to work on the test data and make informed
predictions (e.g., which people in a population are likely to develop coronary heart disease).
In an unsupervised learning setting, machines can recognize patterns in large datasets and
make predictions about the real world without requiring any additional help from humans.
When machines can learn in an unsupervised manner, they are considered to be learning
“deeply.” Deep learning is a relatively modern technique used to implement machine
learning. A deep learning algorithm takes a dataset and finds patterns and critical information
by imitating how a human brain’s neurons interact with each other. The algorithms are
artificial neural networks — a computing system that simulates the brain’s ability to weigh
the importance of some data versus others, and handle bias.
16
Chapter 3
As of 2021, 20 years have passed since the landmark completion of the draft human genome
sequence. This milestone has led to the generation of an extraordinary amount of genomic
data. Estimates predict that genomics research will generate between 2 and 40 exabytes of
data within the next decade.
DNA sequencing and other biological techniques will continue to increase the number and
complexity of such data sets. This is why genomics researchers need AI/ML-based
computational tools that can handle, extract and interpret the valuable information hidden
within this large trove of data.
The integration of Artificial Intelligence (AI) and Machine Learning (ML) in genomics has
become increasingly important due to the complexity and vast amount of data involved in
understanding the genetic makeup of individuals and populations. Here are several reasons
why AI and ML are crucial in genomics:
2. Pattern Recognition:
The ability to identify subtle patterns is crucial for understanding the genetic
basis of diseases and developing targeted treatments.
17
Chapter 3
4. Personalized Medicine:
By analyzing genomic data, AI can help identify the most effective treatments
and predict potential adverse reactions to specific drugs.
5. Variant Interpretation:
6. Drug Discovery:
Targeting specific genetic markers associated with diseases can lead to more
effective and personalized therapeutic interventions.
7. Functional Genomics:
8. Data Integration:
18
Chapter 3
Although the use of AI/ML tools in genomics is still at an early stage, researchers have
already benefited from developing programs that assist in specific ways.
Using machine learning techniques to identify the primary kind of cancer from a
liquid biopsy.
Using deep learning to improve the function of gene editing tools such as CRISPR.
These are just a few ways by which AI/ML methods are helping predict and identify hidden
patterns in genomic data. Scientists are also using AI/ML to predict future variations in the
genomes of the influenza and SARS-CoV-2 viruses to assist public health efforts.
19
Chapter 4
Advancements in genomics have ushered in an era of unprecedented data generation, with the
capability to sequence entire genomes swiftly and cost-effectively. However, the sheer
volume and complexity of genomic data present challenges in analysis, interpretation, and
deriving meaningful insights. This is where the integration of Artificial Intelligence (AI) and
Machine Learning (ML) emerges as a transformative force in genomics research.
Artificial Intelligence (AI), a branch of computer science that aims to create intelligent
machines capable of learning and problem-solving, has found a fertile ground in genomics.
AI algorithms, particularly those based on deep learning, demonstrate remarkable capabilities
in pattern recognition and feature extraction, making them well-suited for handling intricate
genomic datasets.
AI is applied to tasks such as variant calling, where it can accurately identify genetic
variations from raw sequencing data, and genomic annotation, where it assists in interpreting
the functional significance of genetic variants. Additionally, AI plays a pivotal role in
predicting the three-dimensional structures of proteins, aiding in understanding their
functions and interactions.
In the last decades, ML has been widely used in many areas of ‘‘omics” sciences, especially
those characterized by the production of large amounts of data and/or complex mechanisms
governed by the synergic participation of different factors. Important applications include:
prediction of DNA regulatory regions; discovery of cell morphology and spatial organization;
20
Chapter 4
Machine Learning (ML), a subset of AI, involves the development of algorithms that enable
computers to learn patterns and make predictions without explicit programming. In genomics,
ML algorithms contribute to a wide array of applications, including disease prediction, drug
discovery, and population genetics.
21
Chapter 4
variations and disease susceptibility. These predictive models offer valuable insights into
personalized medicine, guiding clinicians in tailoring treatments based on individual genetic
profiles.
In contrast, negative feedback essentially guides the ML model to evade making the same
decision again in the hereafter. In contrast to supervised or unsupervised ML techniques,
reinforcement learning plays a minor part in precision medicine approaches because of the
direct response. Machine learning is primarily classified into three types: classification,
clustering, and regression. Supervised learning techniques include classification and
regression, whereas clustering is an unsupervised learning technique. Classification uses
labels and parameters to predict discrete, categorical response values, such as detecting
malignancy through biopsy samples. Clustering is used to segment data, for example, to
determine the currency of a disease in a given community as a result of pollution or chemical
spills. Regression forecasts continuous-response numeric data to discover administration
trends, such as the time interval between a patient's discharge and readmission to the hospital
(positive/negative).
22
Chapter 4
It also identifies phenotypes, decode clinical statements out of death certificates and post-
mortem reports of patients, identifies cardiovascular diseases, cancer, and symptoms related
to different diseases, predicting and inter-venting risk, and paneling and resourcing. In
precision medicine, there are ten algorithms which are generally used. They are SVM, genetic
algorithm, hidden Markov, linear regression, DA, decision tree, logistic regression, Naïve
Bayes, deep-learning model (HMM), random forest, and K-nearest neighbor (KNN).
SVM : SVM classify and analyze symptoms to develop better diagnostic accuracy.
The other contributions of SVM in precision medicine include identifying biomarkers
of neurological and psychological diseases and analyzing SNPs to validate multiple
myeloma and breast cancer. Clinical, pathological, and epidemiological data are
analyzed by SVM to resist breast and cervical cancer. It analyzes clinical, molecular,
and genomic data to validate oral cancer and diagnose mental disease
23
Chapter 4
Logistic Regression : This algorithm can evaluate the potential risk of several
complex diseases such as breast cancer and tuberculosis. It also contributes to
assessing patient survival rates and identifying cardiovascular disease. By analyzing
prognostic factors, it can identify pulmonary thromboembolism (PTE) and non-
lymphoma Hodgkin's diagnosis.
Random Forest : This algorithm has been widely employed in several parts of the
healthcare system. The reported contributions of this algorithm include prediction of
metabolic pathways of individuals, predicting results of a patient’s encounter with
psychiatrist, mortality prediction of ICU patients, classification and diagnosis of
Alzheimer’s disease monitoring medical wireless sensors, detecting knee
osteoarthritis, healthcare cost prediction, diagnosing mental illness, identifying non-
medical factors related to health, predicting the risk of emergency admission,
forecasting disease risks from clinical error data, finding factor accompanied with
diabetic peripheral neuropathy diagnosis, identification of patients who are ready to
get discharged from ICU, detecting depression Alzheimer patients, and diagnosing
sleep disorders and non-assumptive diverse treatment effects.
24
Chapter 4
Naïve Bayes : This algorithm is being used in distinct areas of medicine such as
predicting risks by identifying Mucopolysaccharidosis type II, utilizing censored and
time-to-event data, classifying EHR, shaping clinical diagnosis for decision support,
extracting genome-wide data to identify Alzheimer's disease, modeling a decision
related to cardiovascular disease, measuring quality healthcare services, constructing
a predictive model for cancer in brain, asthma, prostate, and breast.
KNN : KNN has been employed in various scientific domains, although it has just a
few uses in the healthcare system. It was implemented in preserving the confidential
information of clinical prediction in the e-Health cloud, pattern classification for
breast cancer diagnosis, pancreatic cancer prediction using published literature,
modeling diagnostic performance, detection of gastric cancer, pattern classification
for health monitoring applications, medical dataset classification, and EHR data are
some examples of real-time examples.
25
Chapter
5
Sequencing simply means determining the exact order of the bases in a strand of DNA.
Because bases exist as pairs, and the identity of one of the bases in the pair determines the
other member of the pair, researchers do not have to report both bases of the pair
In the most common type of sequencing used today, called sequencing by synthesis, DNA
polymerase (the enzyme in cells that synthesizes DNA) is used to generate a new strand of
DNA from a strand of interest. In the sequencing reaction, the enzyme incorporates into the
new DNA strand individual nucleotides that have been chemically tagged with a fluorescent
label. As this happens, the nucleotide is excited by a light source, and a fluorescent signal is
emitted and detected. The signal is different depending on which of the four nucleotides was
incorporated. This method can generate 'reads' of 125 nucleotides in a row and billions of
reads at a time.
To assemble the sequence of all the bases in a large piece of DNA such as a gene, researchers
need to read the sequence of overlapping segments. This allows the longer sequence to be
assembled from shorter pieces, somewhat like putting together a linear jigsaw puzzle. In this
26
Chapter
5
process, each base has to be read not just once, but at least several times in the overlapping
segments to ensure accuracy.
Researchers can use DNA sequencing to search for genetic variations and/or mutations that
may play a role in the development or progression of a disease. The disease-causing change
may be as small as the substitution, deletion, or addition of a single base pair or as large as a
deletion of thousands of bases.
AI and ML play a pivotal role in predicting disease risks based on genomic information. By
analyzing patterns in genomic data from both affected and healthy individuals, ML
algorithms can identify subtle correlations between genetic variations and disease
susceptibility. These predictive models assist in assessing an individual's likelihood of
developing certain diseases, allowing for proactive and personalized healthcare strategies.
Applications in disease prediction extend to various medical fields, including cancer risk
assessment, cardiovascular disease prediction, and neurodegenerative disorders. ML models,
trained on diverse genomic datasets, contribute to more accurate risk assessments and early
interventions.
Cancer Genomics
In the last decades, the rise of NGS techniques has revolutionized the medical approach to
cancer [177]. Genomics has become increasingly important in clinical study, prevention,
treatment and monitoring practices. Cancer genomics studies differences in DNA sequences
and gene expression between tumour and normal cells, with the aim to understand the
dynamics underlying the formation and spread of tumours at the genetic, metabolic, systemic
and environmental level. The Cancer Genome Atlas [178] project C. Caudai, A. Galizia, F.
Geraci et al. Computational and Structural Biotechnology Journal 19 (2021) 5762–5790 5770
collected multi-level NGS data for 33 different types of common tumours, an enormous data
resource made available to study tumour-specific as well as recurrent cancer mechanisms.
The availability and integration of large quantities of genomic, proteomic and epigenomic
information has allowed increasingly comprehensive representations of complex dynamics,
such as cancer formation[179], to be obtained. Indeed, integration of multiple omics data can
27
Chapter
5
help overcome possible noise and/or bias of single data layers, thus improving the relevance
of extracted representative features. In this framework, data integration has been an active
field of research for ML and DL techniques applied to omics data, especially cancer
genomics [180,181] (see Section 4.3 for a more detailed discussion on data integration). In
particular, the introduction of autoencoders, such as denoising autoencoders, has allowed
robust representations of heterogeneous data to be provided, and extraction of highly
representative and predictive features to be more easily performed [182–184]. Indeed, AI
applications to cancer genomics can provide useful information for a rapid growth of
precision medicine and for disease prevention and monitoring. ML applications to mutation
detection and interpretation can help in identifying cancer-predisposing genes such as
BRCA1/2 and in predicting cancer risk [185,186]. AI performances in cancer genomics are
very promising.
In 2017, Way et al. developed an ML approach based on ensemble logistic regression, which
was trained on both mutation and transcriptomic profiles of glioblastoma from The Cancer
Genome Atlas, to predict genes that may exhibit synthetic lethality in cancer cells lacking the
neurofibromin 1 tumour suppressor gene. In 2019, Das et al. implemented DiscoverSL, a
multiparameter RF classifier trained on multi-omic cancer data from The Cancer Genome
Atlas [178] to predict and visualize synthetic lethality in cancers. In 2020, Wan et al.
developed EXP2SL, a semisupervised NN-based method, which was trained on a large
collection of cancer cell line expression signatures from the LINCS1000 Program [198], to
predict cancer cell-line specific synthetic lethal interactions.
28
Chapter
5
therapeutic success. This targeted approach accelerates the initial stages of drug
discovery, enabling researchers to focus their efforts on biologically relevant targets,
ultimately increasing the chances of developing successful therapeutics.
29
CONCLUSION
In the dynamic intersection of Artificial Intelligence (AI), Machine Learning (ML), and
genomics, we find ourselves at the forefront of a scientific revolution that is reshaping the
landscape of biological research, healthcare, and drug discovery. This report has traversed the
fundamentals of genomics, explored the transformative applications of AI and ML in
genomics, and delved into the synergy that is propelling the fields forward.
AI and ML bring forth a new era in genomics, offering solutions to challenges that were once
insurmountable. The accuracy of variant calling, the precision of disease risk prediction, and
the efficiency of drug discovery exemplify the transformative power of these technologies.
The ability to analyze massive datasets, recognize intricate patterns, and predict outcomes has
propelled genomics into a realm where the translation of research findings into tangible
applications is more rapid and effective than ever before.
Looking ahead, the implications of this intersection extend beyond the confines of scientific
laboratories. In healthcare, the promise of personalized medicine is becoming a reality, where
treatments are tailored to individual genetic profiles. In drug discovery, the acceleration of
the development pipeline is bringing potential therapeutics to market faster and with higher
success rates. The impact on agriculture, environmental conservation, and our fundamental
understanding of life itself is profound.
30
As we stand at the crossroads of genomics and AI/ML, the journey forward holds great
promise and responsibility. The knowledge gleaned from this synergy has the potential to not
only transform the way we approach health and disease but also contribute to a deeper
understanding of the fundamental mechanisms that govern life. By fostering interdisciplinary
collaboration, embracing ethical considerations, and pushing the boundaries of innovation,
we embark on a path where the convergence of AI, ML, and genomics continues to shape a
future rich in scientific discovery and societal impact.
31
REFERENCES
https://en.wikipedia.org/wiki/Genomics
https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0689-8
https://medlineplus.gov/genetics/understanding/basics/gene/
https://www.genome.gov/about-nhgri/Director/genomics-landscape/jan-5-2023-artificial-
intelligence-and-machine-learning-becoming-pervasive-at-nhgri-and-in-genomics
https://www.genome.gov/about-genomics/fact-sheets/A-Brief-Guide-to-Genomics
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9198206/
32