Beginner Tutorial: Computational Biomarker Discovery Using RNA-seq
1. Introduction to Biomarker Discovery
Biomarker discovery involves identifying biological molecules (genes, proteins, metabolites) that are
indicators of a particular disease state or therapeutic response. In this tutorial, you'll learn how to identify
gene biomarkers from RNA-seq data using R. We'll work with a real dataset: GSE5364 (Breast Cancer
dataset from the GEO database).
2. Requirements
You need to install R and RStudio. Then install the required R packages by running the following code:
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "GEOquery", "pheatmap", "ggplot2",
"EnhancedVolcano", "org.Hs.eg.db", "clusterProfiler"))
3. Loading RNA-seq Data
We'll download the breast cancer gene expression data from GEO:
library(GEOquery)
gse <- getGEO("GSE5364", GSEMatrix = TRUE)
exprSet <- exprs(gse[[1]])
phenoData <- pData(gse[[1]])
4. Setting Labels
We define sample groups (Cancer vs Normal):
group <- ifelse(grepl("normal", phenoData$title, ignore.case = TRUE), "Normal", "Cancer")
group <- factor(group)
5. Differential Expression Analysis
We use DESeq2 to find differentially expressed genes:
library(DESeq2)
Beginner Tutorial: Computational Biomarker Discovery Using RNA-seq
dds <- DESeqDataSetFromMatrix(countData = exprSet, colData = data.frame(group), design = ~ group)
dds <- DESeq(dds)
res <- results(dds)
head(res[order(res$pvalue), ])
6. Volcano Plot Visualization
Use EnhancedVolcano to plot significantly different genes:
library(EnhancedVolcano)
EnhancedVolcano(res,
lab = rownames(res),
x = "log2FoldChange",
y = "pvalue",
pCutoff = 0.05,
FCcutoff = 1)
7. GO Enrichment Analysis
Convert gene names to Entrez IDs and analyze biological processes:
library(clusterProfiler)
library(org.Hs.eg.db)
sig_genes <- rownames(res[which(res$padj < 0.05 & abs(res$log2FoldChange) > 1), ])
entrez_ids <- mapIds(org.Hs.eg.db, keys=sig_genes, column="ENTREZID", keytype="SYMBOL",
multiVals="first")
go_results <- enrichGO(gene = na.omit(entrez_ids),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH")
barplot(go_results, showCategory = 10)
8. Conclusion
In this case study, we identified potential gene biomarkers for breast cancer using DESeq2 and visualized
them with volcano plots. We then explored their biological roles using GO enrichment. This process is
foundational for biomarker research and clinical diagnostics.