This section provides a minimum introduction to R programming for handling genomic data and conducting GWAS analyses. R is a powerful statistical programming language that is widely used in bioinformatics, statistical genetics, and data analysis.
If you are a beginner with no background in programming, this tutorial will help you learn the fundamental concepts needed to work with genomic data in R.
!!! note "For beginners" This tutorial assumes no prior programming experience. We'll start with the basics and gradually build up to more complex concepts relevant to genomic data analysis.
!!! tip "Why R for genomics?" - Statistical power: R was designed for statistical analysis and has extensive statistical functions - Rich ecosystem: Thousands of packages for genomics (Bioconductor, CRAN) - Data visualization: Excellent plotting capabilities (ggplot2, base R graphics) - Reproducibility: R Markdown for creating reproducible reports - Community: Large bioinformatics and statistics community - GWAS tools: Many GWAS-specific packages available
On this page
[TOC]
R can be installed from CRAN (The Comprehensive R Archive Network) or using package managers.
!!! info "CRAN" https://cran.r-project.org/
!!! example "Check R version"
r > R.version.string [1] "R version 4.3.0 (2023-04-21)"
It is convenient to use conda to manage your R environment.
!!! example "Install R with conda"
bash conda install -c conda-forge r-base=4.x.x
Posit (formerly RStudio) is one of the most commonly used Integrated Development Environment (IDE) for R.
!!! info "Posit (RStudio)" https://posit.co/
RStudio provides:
- Syntax highlighting
- Integrated help system
- Package management
- R Markdown support
- Debugging tools
- Git integration
There are several ways to run R code:
- Interactive R (REPL): Type
Rin terminal - R scripts: Save code in
.Rfiles and run withRscript script.R - RStudio: Use the integrated console or run scripts
- R Markdown: Create reproducible reports with embedded R code
!!! example "Interactive R session" ```r $ R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
...
> print("Hello, World!")
[1] "Hello, World!"
> q() # or quit()
```
!!! example "R script"
Create a file hello.R:
r # hello.R print("Hello, World!")
Run it:
```bash
$ Rscript hello.R
[1] "Hello, World!"
```
Comments help document your code. Use # for single-line comments:
!!! example "Comments in R" ```r # This is a comment x <- 5 # This is also a comment
# Multi-line comments require # on each line
# This is line 1 of a comment
# This is line 2 of a comment
```
R has several atomic (basic) data types. The most common ones are:
R supports integers and floating-point numbers (numeric):
!!! example "Numbers" ```r # Numeric (floating-point) x <- 42 y <- 3.14159 height <- 175.5
# Integer (specify with L)
count <- 100L
# Basic arithmetic
sum_result <- 10 + 5 # 15
product <- 3 * 4 # 12
division <- 10 / 3 # 3.333...
power <- 2 ^ 3 # 8 (or 2 ** 3)
remainder <- 10 %% 3 # 1 (modulo)
integer_division <- 10 %/% 3 # 3
```
Strings are sequences of characters, enclosed in single or double quotes:
!!! example "Strings" ```r # String creation name <- "Alice" chromosome <- 'chr1' variant_id <- "rs123456"
# String operations
full_name <- paste(name, "Smith") # "Alice Smith"
full_name2 <- paste(name, "Smith", sep = "") # "AliceSmith"
repeated <- paste(rep("AT", 3), collapse = "") # "ATATAT"
length <- nchar(name) # 5
# String functions
text <- " hello world "
trimws(text) # "hello world"
toupper(text) # " HELLO WORLD "
tolower(text) # " hello world "
# String formatting (using sprintf or paste)
variant <- "rs123456"
p_value <- 0.0001
message <- paste("Variant", variant, "has p-value", p_value)
# "Variant rs123456 has p-value 0.0001"
# Or using sprintf
message2 <- sprintf("Variant %s has p-value %.4f", variant, p_value)
# "Variant rs123456 has p-value 0.0001"
```
!!! example "Working with genomic sequences" ```r # DNA sequence sequence <- "ATGCGATCG" sequence_length <- nchar(sequence) # 9
# Count GC content
gc_count <- sum(strsplit(sequence, "")[[1]] %in% c("G", "C"))
gc_content <- gc_count / sequence_length # 0.444...
# Using stringr package (if installed)
# library(stringr)
# gc_count <- str_count(sequence, "[GC]")
```
Logicals represent truth values: TRUE or FALSE (can be abbreviated as T or F):
!!! example "Logicals" ```r is_significant <- TRUE is_rare <- FALSE
# Logical operations
result1 <- TRUE & FALSE # FALSE (AND)
result2 <- TRUE | FALSE # TRUE (OR)
result3 <- !TRUE # FALSE (NOT)
# Comparisons return logicals
p_value <- 0.0001
is_significant <- p_value < 0.05 # TRUE
```
!!! tip "TRUE/FALSE vs T/F"
While T and F can be used as abbreviations, it's better practice to use TRUE and FALSE because T and F can be overwritten as variables.
Variables store values. In R, use <- or = for assignment (though <- is preferred):
!!! example "Variables" ```r # Variable assignment variant_id <- "rs123456" chromosome <- 1 position <- 12345 p_value <- 0.0001 is_significant <- p_value < 5e-8
# Variable names should be descriptive
sample_size <- 10000
allele_frequency <- 0.25
effect_size <- 0.15
# Check variable type
class(variant_id) # "character"
class(chromosome) # "numeric"
class(is_significant) # "logical"
```
!!! tip "Variable naming conventions"
- Use lowercase with dots or underscores: variant_id, p.value
- Be descriptive: chr is better than c, p_value is better than p
- Avoid R keywords: if, for, function, TRUE, FALSE, etc.
R supports various operators:
!!! info "Common operators"
| Operator | Description | Example |
|----------|-------------|---------|
| + | Addition | 3 + 5 → 8 |
| - | Subtraction | 10 - 3 → 7 |
| * | Multiplication | 4 * 5 → 20 |
| / | Division | 10 / 3 → 3.333... |
| ^ or ** | Exponentiation | 2 ^ 3 → 8 |
| %% | Modulo (remainder) | 10 %% 3 → 1 |
| %/% | Integer division | 10 %/% 3 → 3 |
| == | Equality | 5 == 5 → TRUE |
| != | Inequality | 5 != 3 → TRUE |
| < | Less than | 3 < 5 → TRUE |
| > | Greater than | 5 > 3 → TRUE |
| <= | Less than or equal | 3 <= 3 → TRUE |
| >= | Greater than or equal | 5 >= 3 → TRUE |
| & | Logical AND | TRUE & FALSE → FALSE |
| \| | Logical OR | TRUE \| FALSE → TRUE |
| ! | Logical NOT | !TRUE → FALSE |
!!! example "Operators in genomics context" ```r # Calculate odds ratio from beta beta <- 0.2 odds_ratio <- exp(beta) # e^beta
# Check if variant passes QC filters
maf <- 0.01
call_rate <- 0.98
hwe_p <- 0.001
passes_qc <- (maf > 0.01) & (call_rate > 0.95) & (hwe_p > 0.0001)
# Check genome-wide significance
p_value <- 1e-8
is_gws <- p_value < 5e-8 # TRUE
```
R provides several data structures for organizing data:
Vectors are one-dimensional arrays that can hold numeric, character, or logical data. All elements must be of the same type:
!!! example "Vectors" ```r # Create vectors using c() (combine) chromosomes <- c(1, 2, 3, 4, 5) variants <- c("rs123", "rs456", "rs789") p_values <- c(0.1, 0.05, 0.01, 0.001) logicals <- c(TRUE, FALSE, TRUE)
# Create sequences
numbers <- 1:10 # 1 2 3 4 5 6 7 8 9 10
even <- seq(2, 10, by = 2) # 2 4 6 8 10
# Vector operations
length(chromosomes) # 5
sum(p_values) # 0.161
mean(p_values) # 0.04025
min(p_values) # 0.001
max(p_values) # 0.1
# Element-wise operations
p_values * 2 # 0.2 0.1 0.02 0.002
p_values + 0.01 # 0.11 0.06 0.02 0.011
-log10(p_values) # 1.0 1.301 2.0 3.0
```
!!! example "Working with variant data in vectors" ```r # Store variant information variant_ids <- c("rs123456", "rs234567", "rs345678") chromosomes <- c(1, 1, 2) positions <- c(12345, 23456, 34567) p_values <- c(0.05, 1e-8, 0.001)
# Find significant variants
is_significant <- p_values < 5e-8
sig_variants <- variant_ids[is_significant]
# Count significant variants
sum(is_significant) # 1
```
Matrices are two-dimensional arrays with rows and columns. All elements must be of the same type:
!!! example "Matrices" ```r # Create a matrix mymatrix <- matrix(1:6, nrow = 2, ncol = 3) # [,1] [,2] [,3] # [1,] 1 3 5 # [2,] 2 4 6
# Matrix properties
ncol(mymatrix) # 3 (number of columns)
nrow(mymatrix) # 2 (number of rows)
dim(mymatrix) # 2 3 (dimensions)
length(mymatrix) # 6 (total elements)
# Matrix operations
t(mymatrix) # Transpose
mymatrix * 2 # Element-wise multiplication
mymatrix %*% t(mymatrix) # Matrix multiplication
```
!!! example "Genotype matrix" ```r # Create a simple genotype matrix (samples x variants) # 0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate genotypes <- matrix(c(0, 1, 2, 1, 0, 1, 2, 1, 0), nrow = 3, ncol = 3) rownames(genotypes) <- c("Sample1", "Sample2", "Sample3") colnames(genotypes) <- c("rs123", "rs456", "rs789")
# Calculate allele frequencies
colMeans(genotypes) / 2 # Allele frequency for alternate allele
```
Lists are special vectors that can contain elements of different types, including other lists:
!!! example "Lists" ```r # Create a list mylist <- list(1, 0.02, "a", FALSE, c(1, 2, 3), matrix(1:6, nrow = 2, ncol = 3))
# Named list
variant_info <- list(
rsid = "rs123456",
chromosome = 1,
position = 12345,
p_value = 0.0001,
effect_size = 0.15
)
# Access elements
variant_info$rsid # "rs123456"
variant_info[["chromosome"]] # 1
variant_info[[1]] # "rs123456" (by position)
# Add elements
variant_info$maf <- 0.25
```
!!! example "Storing multiple variants" ```r # List of variant information variants <- list( rs123456 = list(chr = 1, pos = 12345, p = 1e-8, beta = 0.15), rs234567 = list(chr = 2, pos = 23456, p = 0.05, beta = 0.08) )
# Access nested list
variants$rs123456$chr # 1
variants$rs123456$p # 1e-8
```
Data frames are like tables (similar to Excel spreadsheets). They are lists of vectors of equal length, but each column can be a different type:
!!! example "Data Frames" ```r # Create a data frame df <- data.frame( rsid = c("rs123", "rs456", "rs789"), chromosome = c(1, 1, 2), position = c(12345, 23456, 34567), p_value = c(0.05, 1e-8, 0.001), stringsAsFactors = FALSE # Important: keep strings as character )
# View data frame
df
# rsid chromosome position p_value
# 1 rs123 1 12345 5.00e-02
# 2 rs456 1 23456 1.00e-08
# 3 rs789 2 34567 1.00e-03
# Data frame properties
nrow(df) # 3 (number of rows)
ncol(df) # 4 (number of columns)
dim(df) # 3 4
names(df) # "rsid" "chromosome" "position" "p_value"
str(df) # Structure of data frame
summary(df) # Summary statistics
```
!!! example "Working with GWAS summary statistics" ```r # Create a data frame with GWAS results gwas_results <- data.frame( rsid = c("rs123456", "rs234567", "rs345678"), chr = c(1, 1, 2), pos = c(12345, 23456, 34567), ref = c("A", "G", "T"), alt = c("G", "A", "C"), beta = c(0.15, 0.08, -0.12), se = c(0.03, 0.02, 0.04), p = c(1e-8, 0.05, 0.001), stringsAsFactors = FALSE )
# Add calculated columns
gwas_results$or <- exp(gwas_results$beta) # Odds ratio
gwas_results$mlog10p <- -log10(gwas_results$p) # -log10(p)
# Filter for significant variants
significant <- gwas_results[gwas_results$p < 5e-8, ]
```
Subsetting allows you to access specific elements from data structures:
!!! example "Vector subsetting" ```r myvector <- c(1, 2, 3, 4, 5)
myvector[1] # 1 (R uses 1-based indexing!)
myvector[1:3] # 1 2 3
myvector[c(1, 3, 5)] # 1 3 5
myvector[-1] # 2 3 4 5 (exclude first element)
myvector[-c(1, 2)] # 3 4 5
# Logical indexing
myvector[myvector > 3] # 4 5
myvector[myvector %% 2 == 0] # 2 4 (even numbers)
```
!!! example "Matrix subsetting" ```r mymatrix <- matrix(1:6, nrow = 2, ncol = 3) # [,1] [,2] [,3] # [1,] 1 3 5 # [2,] 2 4 6
mymatrix[1, 2] # 3 (row 1, column 2)
mymatrix[1, ] # 1 3 5 (entire first row)
mymatrix[, 2] # 3 4 (entire second column)
mymatrix[1:2, 2:3] # Submatrix
mymatrix[, -1] # All rows, exclude first column
```
!!! example "Data frame subsetting" ```r df <- data.frame( rsid = c("rs123", "rs456", "rs789"), chr = c(1, 1, 2), p = c(0.05, 1e-8, 0.001), stringsAsFactors = FALSE )
# Access columns
df$rsid # "rs123" "rs456" "rs789"
df[["rsid"]] # Same as above
df[, "rsid"] # Same as above
df["rsid"] # Returns data frame with one column
# Access rows
df[1, ] # First row
df[1:2, ] # First two rows
# Access specific cell
df[1, "rsid"] # "rs123"
df[1, 1] # "rs123"
# Multiple columns
df[, c("rsid", "p")]
# Logical subsetting (filtering)
df[df$p < 0.05, ] # Rows where p < 0.05
df[df$chr == 1, ] # Rows where chromosome is 1
df[df$p < 5e-8 & df$chr == 1, ] # Multiple conditions
```
!!! example "Filtering GWAS results" ```r # Filter for genome-wide significant variants gws <- df[df$p < 5e-8, ]
# Filter for chromosome 1
chr1_variants <- df[df$chr == 1, ]
# Filter with multiple conditions
filtered <- df[df$p < 0.05 & df$chr == 1, ]
# Order by p-value
df_ordered <- df[order(df$p), ]
# Top 10 most significant
top10 <- head(df_ordered, 10)
```
Control flow statements allow you to execute code conditionally or repeatedly.
Use if, else if, and else for conditional execution:
!!! example "If statements" ```r # Basic if statement p_value <- 0.0001
if (p_value < 5e-8) {
print("Genome-wide significant!")
} else if (p_value < 0.05) {
print("Nominally significant")
} else {
print("Not significant")
}
# Multiple conditions
maf <- 0.01
call_rate <- 0.98
if (maf > 0.01 & call_rate > 0.95) {
print("Passes QC")
} else {
print("Fails QC")
}
```
!!! example "ifelse() for vectorized conditionals"
r # ifelse() works on vectors p_values <- c(0.1, 0.05, 1e-8, 0.001) significance <- ifelse(p_values < 5e-8, "GWS", ifelse(p_values < 0.05, "Nominal", "NS")) # "NS" "Nominal" "GWS" "Nominal"
Loops allow you to repeat code. R has for and while loops:
!!! example "For loops" ```r # Iterate over a vector chromosomes <- c(1, 2, 3, 4, 5) for (chr in chromosomes) { print(paste("Processing chromosome", chr)) }
# Iterate with index
variants <- c("rs123", "rs456", "rs789")
for (i in 1:length(variants)) {
print(paste("Index", i, ":", variants[i]))
}
# Iterate over data frame rows (usually not recommended - use vectorized operations)
df <- data.frame(rsid = c("rs123", "rs456"), p = c(0.05, 1e-8))
for (i in 1:nrow(df)) {
if (df$p[i] < 5e-8) {
print(paste(df$rsid[i], "is significant"))
}
}
```
!!! example "While loops" ```r # While loop count <- 0 while (count < 5) { print(count) count <- count + 1 }
# Process until condition is met
p_value <- 1.0
iterations <- 0
while (p_value > 0.05 & iterations < 100) {
# Some calculation that updates p_value
p_value <- p_value * 0.9
iterations <- iterations + 1
}
```
!!! tip "Vectorization in R" R is designed for vectorized operations. Instead of loops, try to use vectorized functions:
```r
# Instead of this (slow):
result <- numeric(length(p_values))
for (i in 1:length(p_values)) {
result[i] <- -log10(p_values[i])
}
# Do this (fast):
result <- -log10(p_values)
```
Functions allow you to organize code into reusable blocks:
!!! example "Defining functions" ```r # Simple function greet <- function(name) { return(paste("Hello,", name, "!")) }
result <- greet("Alice")
print(result) # "Hello, Alice !"
# Function with multiple parameters
calculate_odds_ratio <- function(beta) {
return(exp(beta))
}
or_value <- calculate_odds_ratio(0.2)
print(or_value) # ~1.22
```
!!! example "Function with default parameters" ```r check_significance <- function(p_value, threshold = 5e-8) { return(p_value < threshold) }
# Use default threshold
is_sig1 <- check_significance(1e-8) # TRUE (uses 5e-8)
# Use custom threshold
is_sig2 <- check_significance(0.01, threshold = 0.05) # TRUE
```
!!! example "Genomics-related functions" ```r # Calculate GC content calculate_gc_content <- function(sequence) { sequence <- toupper(sequence) bases <- strsplit(sequence, "")[[1]] gc_count <- sum(bases %in% c("G", "C")) return(gc_count / nchar(sequence)) }
# Filter variants
filter_variants <- function(variants, min_maf = 0.01, max_p = 0.05) {
filtered <- variants[variants$maf >= min_maf & variants$p <= max_p, ]
return(filtered)
}
# Usage
seq <- "ATGCGATCG"
gc <- calculate_gc_content(seq)
print(paste("GC content:", round(gc * 100, 2), "%")) # "GC content: 44.44 %"
variants <- data.frame(
rsid = c("rs123", "rs456", "rs789"),
maf = c(0.05, 0.005, 0.02),
p = c(0.01, 0.03, 0.1),
stringsAsFactors = FALSE
)
filtered <- filter_variants(variants, min_maf = 0.01, max_p = 0.05)
print(filtered$rsid) # "rs123"
```
Reading from and writing to files is essential for working with genomic data:
!!! example "Reading files" ```r # Read tab-separated file data <- read.table("variants.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# Read comma-separated file
data <- read.csv("variants.csv", stringsAsFactors = FALSE)
# Read with more control
data <- read.table("sumstats.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE,
na.strings = c("NA", ".", ""))
```
!!! example "Writing files" ```r # Write tab-separated file write.table(data, "output.txt", sep = "\t", row.names = FALSE, quote = FALSE)
# Write comma-separated file
write.csv(data, "output.csv", row.names = FALSE)
# Write with more control
write.table(data, "output.txt",
sep = "\t",
row.names = FALSE,
col.names = TRUE,
quote = FALSE)
```
!!! example "Processing GWAS summary statistics" ```r # Read sumstats file sumstats <- read.table("sumstats.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# Filter for significant variants
significant <- sumstats[sumstats$P < 5e-8, ]
# Add calculated column
significant$MLOG10P <- -log10(significant$P)
# Write results
write.table(significant,
"significant_variants.txt",
sep = "\t",
row.names = FALSE,
quote = FALSE)
```
!!! tip "Reading large files"
For very large files, consider:
- data.table::fread() - Much faster for large files
- readr::read_tsv() - Part of tidyverse, faster than base R
- Reading in chunks if memory is limited
R's power comes from its extensive package ecosystem. For genomics, key packages include:
!!! example "Installing packages" ```r # Install from CRAN install.packages("ggplot2")
# Install from Bioconductor (for genomics)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
# Install from GitHub
# install.packages("devtools")
# devtools::install_github("user/package")
```
!!! example "Loading packages" ```r # Load a package library(ggplot2)
# Or use package::function() without loading
ggplot2::ggplot(data, aes(x = x, y = y))
```
!!! info "Essential R packages for genomics"
| Package | Purpose | Installation |
|---------|---------|--------------|
| data.table | Fast data manipulation | install.packages("data.table") |
| dplyr | Data manipulation (tidyverse) | install.packages("dplyr") |
| ggplot2 | Data visualization | install.packages("ggplot2") |
| Bioconductor | Genomics packages | See Bioconductor website |
| GenomicRanges | Genomic interval operations | Bioconductor |
| VariantAnnotation | VCF file handling | Bioconductor |
| rtracklayer | Import/export genomic data | Bioconductor |
!!! example "Using dplyr for data manipulation" ```r library(dplyr)
# Read data
sumstats <- read.table("sumstats.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# Filter, mutate, and arrange
results <- sumstats %>%
filter(P < 5e-8) %>%
mutate(MLOG10P = -log10(P)) %>%
arrange(P) %>%
select(RSID, CHR, POS, P, MLOG10P)
# Group by chromosome and summarize
by_chr <- sumstats %>%
group_by(CHR) %>%
summarize(
n_variants = n(),
n_significant = sum(P < 5e-8),
mean_p = mean(P)
)
```
Here are some practical examples combining the concepts above:
!!! example "Parse and filter GWAS results" ```r # Function to process GWAS summary statistics process_gwas_results <- function(filename, p_threshold = 5e-8) { # Read file data <- read.table(filename, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# Filter for significant variants
significant <- data[data$P < p_threshold, ]
# Calculate -log10(p)
significant$MLOG10P <- -log10(significant$P)
# Summary statistics
results <- list(
total = nrow(data),
significant = nrow(significant),
by_chromosome = table(significant$CHR)
)
return(list(data = significant, summary = results))
}
# Usage
gwas_output <- process_gwas_results("sumstats.txt")
print(gwas_output$summary)
```
!!! example "Calculate allele frequencies" ```r # Calculate minor allele frequency from genotype counts calculate_maf <- function(genotype_counts) { # genotype_counts should be a named vector: c(AA = 100, Aa = 50, aa = 10) total <- sum(genotype_counts) if (total == 0) return(0.0)
# Count alleles
a_count <- 2 * genotype_counts["AA"] + genotype_counts["Aa"]
total_alleles <- 2 * total
maf <- a_count / total_alleles
# Return the minor (less frequent) allele frequency
return(min(maf, 1 - maf))
}
# Example usage
counts <- c(AA = 100, Aa = 50, aa = 10)
maf <- calculate_maf(counts)
print(paste("Minor allele frequency:", round(maf, 3)))
```
!!! example "Manhattan plot data preparation" ```r # Prepare data for Manhattan plot prepare_manhattan_data <- function(sumstats) { # Add -log10(p) sumstats$MLOG10P <- -log10(sumstats$P)
# Calculate cumulative positions for plotting
sumstats <- sumstats[order(sumstats$CHR, sumstats$POS), ]
cumulative_pos <- 0
chr_breaks <- numeric()
for (chr in unique(sumstats$CHR)) {
chr_data <- sumstats[sumstats$CHR == chr, ]
max_pos <- max(chr_data$POS)
sumstats$cumulative_pos[sumstats$CHR == chr] <-
cumulative_pos + sumstats$POS[sumstats$CHR == chr]
cumulative_pos <- cumulative_pos + max_pos
chr_breaks <- c(chr_breaks, cumulative_pos)
}
return(sumstats)
}
```
R has extensive built-in statistical functions:
!!! example "Normal distribution functions" ```r # Probability density function dnorm(1.96) # 0.05844094
# Cumulative distribution function
pnorm(1.96) # 0.9750021
pnorm(1.96, lower.tail = FALSE) # 0.0249979
# Quantile function
qnorm(0.975) # 1.959964
# Generate random values
rnorm(10, mean = 0, sd = 1)
```
!!! example "Chi-square distribution functions" ```r # Probability density function dchisq(5, df = 3)
# Cumulative distribution function
pchisq(5, df = 3)
# Quantile function
qchisq(0.95, df = 3)
# Generate random values
rchisq(10, df = 3)
```
!!! example "Linear and logistic regression" ```r # Linear regression # y ~ x1 + x2 means: y is modeled as a function of x1 and x2 model <- lm(formula = phenotype ~ genotype + age + sex, data = mydata)
# View results
summary(model)
coefficients(model)
# Logistic regression
logistic_model <- glm(formula = disease ~ genotype + age + sex,
family = binomial(link = "logit"),
data = mydata)
summary(logistic_model)
```
!!! example "GWAS association test" ```r # Simple GWAS association test # Assuming you have phenotype and genotype data perform_gwas <- function(phenotype, genotype, covariates = NULL) { if (is.null(covariates)) { model <- lm(phenotype ~ genotype) } else { # Include covariates formula_str <- "phenotype ~ genotype" for (cov in names(covariates)) { formula_str <- paste(formula_str, "+", cov) } model <- lm(as.formula(formula_str), data = cbind(phenotype, genotype, covariates)) }
results <- summary(model)$coefficients
return(list(
beta = results["genotype", "Estimate"],
se = results["genotype", "Std. Error"],
p = results["genotype", "Pr(>|t|)"]
))
}
```
!!! tip "Code organization"
- Use meaningful variable names
- Add comments for complex logic
- Break code into functions
- Keep functions focused on one task
- Use consistent style (consider styler package)
!!! tip "Performance"
- Prefer vectorized operations over loops
- Use data.table or dplyr for large data frames
- Consider parallel processing for intensive computations
- Profile code to identify bottlenecks
!!! tip "Reproducibility"
- Set random seeds: set.seed(123)
- Use sessionInfo() to record package versions
- Consider using renv for package management
- Use R Markdown for reproducible reports
!!! tip "Error handling"
r # Use tryCatch for error handling result <- tryCatch({ read.table("file.txt", header = TRUE) }, error = function(e) { print(paste("Error reading file:", e$message)) return(NULL) })
Now that you understand R basics, you can:
- Learn tidyverse: Modern R packages for data manipulation (dplyr, tidyr, ggplot2)
- Explore Bioconductor: Genomics-specific packages and workflows
- Practice with real data: Work with actual GWAS summary statistics
- Learn visualization: ggplot2 for publication-quality plots
- Advanced topics: Object-oriented programming, package development, Shiny apps
- Official R documentation: https://www.r-project.org/
- R for Data Science (book by Hadley Wickham): https://r4ds.hadley.nz/
- Bioconductor: https://bioconductor.org/
- CRAN: https://cran.r-project.org/
- RStudio Cheat Sheets: https://posit.co/resources/cheatsheets/