The goal of quickr is to make your R code run quicker.
R is an extremely flexible and dynamic language, but that flexibility and dynamicism can come at the expense of speed. This package lets you trade back some of that flexibility for some speed, for the context of a single function.
The main exported function is quick(). Here is how you use it:
library(quickr)
convolve <- quick(function(a, b) {
declare(type(a = double(NA)),
type(b = double(NA)))
ab <- double(length(a) + length(b) - 1)
for (i in seq_along(a)) {
for (j in seq_along(b)) {
ab[i+j-1] <- ab[i+j-1] + a[i] * b[j]
}
}
ab
})quick() returns a quicker R function. How much quicker? Let’s
benchmark it! For reference, we’ll also compare it to a
pure-C
implementation.
slow_convolve <- function(a, b) {
declare(type(a = double(NA)),
type(b = double(NA)))
ab <- double(length(a) + length(b) - 1)
for (i in seq_along(a)) {
for (j in seq_along(b)) {
ab[i+j-1] <- ab[i+j-1] + a[i] * b[j]
}
}
ab
}
library(quickr)
quick_convolve <- quick(slow_convolve)
convolve_c <- inline::cfunction(
sig = c(a = "SEXP", b = "SEXP"), body = r"({
int na, nb, nab;
double *xa, *xb, *xab;
SEXP ab;
a = PROTECT(Rf_coerceVector(a, REALSXP));
b = PROTECT(Rf_coerceVector(b, REALSXP));
na = Rf_length(a); nb = Rf_length(b); nab = na + nb - 1;
ab = PROTECT(Rf_allocVector(REALSXP, nab));
xa = REAL(a); xb = REAL(b); xab = REAL(ab);
for(int i = 0; i < nab; i++) xab[i] = 0.0;
for(int i = 0; i < na; i++)
for(int j = 0; j < nb; j++)
xab[i + j] += xa[i] * xb[j];
UNPROTECT(3);
return ab;
})")
a <- runif(100000); b <- runif(100)
timings <- bench::mark(
r = slow_convolve(a, b),
quickr = quick_convolve(a, b),
c = convolve_c(a, b),
min_time = 2
)
timings
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 r 481ms 481.18ms 2.08 782KB 8.31
#> 2 quickr 911µs 1.06ms 940. 782KB 18.2
#> 3 c 916µs 1.06ms 946. 782KB 17.9
plot(timings) + bench::scale_x_bench_time(base = NULL)In the case of convolve(), quick() returns a function approximately
200 times quicker, with performance similar to the C function.
quick() can accelerate any R function, with some restrictions:
- Function arguments must have their types and shapes declared using
declare(). - Only atomic vectors, matrices, and array are currently supported:
integer,double,logical, andcomplex. - The return value must be an atomic array (e.g., not a list)
- Named variables must have consistent shapes throughout their lifetimes.
NAvalues are not supported.- Only a subset of R’s vocabulary is currently supported.
#> [1] - : ! != (
#> [6] [ [<- [<<- { *
#> [11] / & && %*% %/%
#> [16] %% %o% ^ + <
#> [21] <- <<- <= = ==
#> [26] > >= | || Arg
#> [31] Conj Fortran Im Mod Re
#> [36] abs acos array as.double asin
#> [41] atan backsolve break c cat
#> [46] cbind ceiling character chol chol2inv
#> [51] cos crossprod declare diag dim
#> [56] double exp floor for forwardsolve
#> [61] if ifelse integer length log
#> [66] log10 logical matrix max min
#> [71] ncol next nrow numeric outer
#> [76] print prod qr.solve raw rbind
#> [81] repeat runif seq seq_along seq_len
#> [86] sin solve sqrt sum t
#> [91] tan tcrossprod which.max which.min while
Many of these restrictions are expected to be relaxed as the project matures. However, quickr is intended primarily for high-performance numerical computing, so features like polymorphic dispatch or support for complex or dynamic types are out of scope.
The shape and mode of all function arguments must be declared. Local and return variables may optionally also be declared.
declare(type()) also has support for declaring size constraints, or
size relationships between variables. Here are some examples of declare
calls:
declare(type(x = double(NA))) # x is a 1-d double vector of any length
declare(type(x = double(10))) # x is a 1-d double vector of length 10
declare(type(x = double(1))) # x is a scalar double
declare(type(x = integer(2, 3))) # x is a 2-d integer matrix with dim (2, 3)
declare(type(x = integer(NA, 3))) # x is a 2-d integer matrix with dim (<any>, 3)
# x is a 4-d logical matrix with dim (<any>, 24, 24, 3)
declare(type(x = logical(NA, 24, 24, 3)))
# x and y are 1-d double vectors of any length
declare(type(x = double(NA)),
type(y = double(NA)))
# x and y are 1-d double vectors of the same length
declare(
type(x = double(n)),
type(y = double(n)),
)
# x and y are 1-d double vectors, where length(y) == length(x) + 2
declare(type(x = double(n)),
type(y = double(n+2)))Tip: declare shapes as specifically as you can. quickr uses these size
constraints both for compile-time checking and for choosing more
efficient implementations. For example, if you know a matrix is square,
declare it as type(A = double(n, n)) (not double(n, k)). That can
allow quickr to use a faster code path for operations like
solve(A, b). If the compiler can’t prove a matrix is square, it may
need to fall back to a more general rectangular solver.
The Viterbi algorithm is an example of a dynamic programming algorithm
within the family of Hidden Markov Models
(https://en.wikipedia.org/wiki/Viterbi_algorithm). Here, quick()
makes viterbi() approximately 50 times faster.
slow_viterbi <- function(observations, states, initial_probs, transition_probs, emission_probs) {
declare(
type(observations = integer(num_steps)),
type(states = integer(num_states)),
type(initial_probs = double(num_states)),
type(transition_probs = double(num_states, num_states)),
type(emission_probs = double(num_states, num_obs)),
)
trellis <- matrix(0, nrow = length(states), ncol = length(observations))
backpointer <- matrix(0L, nrow = length(states), ncol = length(observations))
trellis[, 1] <- initial_probs * emission_probs[, observations[1]]
for (step in 2:length(observations)) {
for (current_state in 1:length(states)) {
probabilities <- trellis[, step - 1] * transition_probs[, current_state]
trellis[current_state, step] <- max(probabilities) * emission_probs[current_state, observations[step]]
backpointer[current_state, step] <- which.max(probabilities)
}
}
path <- integer(length(observations))
path[length(observations)] <- which.max(trellis[, length(observations)])
for (step in seq(length(observations) - 1, 1)) {
path[step] <- backpointer[path[step + 1], step + 1]
}
out <- states[path]
out
}
quick_viterbi <- quick(slow_viterbi)
set.seed(1234)
num_steps <- 16
num_states <- 8
num_obs <- 16
observations <- sample(1:num_obs, num_steps, replace = TRUE)
states <- 1:num_states
initial_probs <- runif (num_states)
initial_probs <- initial_probs / sum(initial_probs) # normalize to sum to 1
transition_probs <- matrix(runif (num_states * num_states), nrow = num_states)
transition_probs <- transition_probs / rowSums(transition_probs) # normalize rows
emission_probs <- matrix(runif (num_states * num_obs), nrow = num_states)
emission_probs <- emission_probs / rowSums(emission_probs) # normalize rows
timings <- bench::mark(
slow_viterbi = slow_viterbi(observations, states, initial_probs,
transition_probs, emission_probs),
quick_viterbi = quick_viterbi(observations, states, initial_probs,
transition_probs, emission_probs)
)
timings
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slow_viterbi 62.36µs 72.08µs 13291. 1.59KB 29.2
#> 2 quick_viterbi 1.68µs 1.89µs 522138. 0B 0
plot(timings)Simulate how heat spreads over time across a 2D grid, using the finite difference method applied to the Heat Equation.
Note the use of local closures within the quick function. Here,
quick() returns a function approximately 20 times faster. The speedup
is relatively modest because the core operation of computing the
laplacian is already expressed as a vectorized operation. If we were
instead comparing against for loops that operate on individual array
elements, the speedup would be much more substantial. In general,
idiomatic vectorized R code is quite fast already!
diffuse_heat <- function(nx, ny, dx, dy, dt, k, steps) {
declare(
type(nx = integer(1)),
type(ny = integer(1)),
type(dx = integer(1)),
type(dy = integer(1)),
type(dt = double(1)),
type(k = double(1)),
type(steps = integer(1))
)
temp <- matrix(0, nx, ny)
temp[nx %/% 2L, ny %/% 2L] <- 100
apply_boundary_conditions <- function(temp) {
temp[1, ] <- 0
temp[nx, ] <- 0
temp[, 1] <- 0
temp[, ny] <- 0
temp
}
update_temperature <- function(temp) {
temp_new <- temp
i <- 2:(nx - 1)
j <- 2:(ny - 1)
laplacian <-
(temp[i + 1, j] - 2 * temp[i, j] + temp[i - 1, j]) / dx ^ 2 +
(temp[i, j + 1] - 2 * temp[i, j] + temp[i, j - 1]) / dy ^ 2
temp_new[i, j] <- temp[i, j] + k * dt * laplacian
temp_new
}
for (step in seq_len(steps)) {
temp <- temp |>
apply_boundary_conditions() |>
update_temperature()
}
temp
}
quick_diffuse_heat <- quick(diffuse_heat)
# Parameters
nx <- 100L # Grid size in x
ny <- 100L # Grid size in y
dx <- 1L # Grid spacing
dy <- 1L # Grid spacing
dt <- 0.01 # Time step
k <- 0.1 # Thermal diffusivity
steps <- 500L # Number of time steps
timings <- bench::mark(
diffuse_heat = diffuse_heat(nx, ny, dx, dy, dt, k, steps),
quick_diffuse_heat = quick_diffuse_heat(nx, ny, dx, dy, dt, k, steps)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
summary(timings, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 diffuse_heat 10.1 11.6 1 4893. Inf
#> 2 quick_diffuse_heat 1 1 13.4 1 NaN
plot(timings)Here is quickr used to calculate a rolling mean. The CRAN package RcppRoll already provides a highly optimized rolling mean, which we include in the benchmarks for comparison.
slow_roll_mean <- function(x, weights, normalize = TRUE) {
declare(
type(x = double(NA)),
type(weights = double(NA)),
type(normalize = logical(1))
)
out <- double(length(x) - length(weights) + 1)
n <- length(weights)
if (normalize)
weights <- weights/sum(weights)*length(weights)
for(i in seq_along(out)) {
out[i] <- sum(x[i:(i+n-1)] * weights) / length(weights)
}
out
}
quick_roll_mean <- quick(slow_roll_mean)
x <- dnorm(seq(-3, 3, len = 100000))
weights <- dnorm(seq(-1, 1, len = 100))
timings <- bench::mark(
r = slow_roll_mean(x, weights),
rcpp = RcppRoll::roll_mean(x, weights = weights),
quickr = quick_roll_mean(x, weights = weights)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
timings
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 r 68.34ms 78.79ms 9.64 124.24MB 21.2
#> 2 rcpp 5.92ms 6.29ms 155. 4.46MB 1.99
#> 3 quickr 2.15ms 2.26ms 438. 781.35KB 3.98
timings$expression <- factor(names(timings$expression), rev(names(timings$expression)))
plot(timings) + bench::scale_x_bench_time(base = NULL)quickr supports a variety of matrix operations from base R, including
most of the common linear algebra functions like %*%, crossprod(),
tcrossprod(), solve(), and chol(). Performance is generally
similar to running the same code in R itself, but as you start to chain
multiple operations in a function, you may see speed-ups due to reduced
interpreter overhead. Actual performance depends on the BLAS/LAPACK
libraries linked to your R build (e.g. OpenBLAS, MKL, Accelerate).
To illustrate, here is a benchmark showing a fastLm-style
implementation comparing base R, quickr, and (for reference) a compiled
RcppArmadillo implementation. The fastLm function is adapted from the
README of the RcppArmadillo project.
fast_lm <- function(X, y) {
declare(
type(X = double(n, k)),
type(y = double(n))
)
n <- nrow(X)
k <- ncol(X)
df_residual <- n - k
coef <- qr.solve(X, y) # fit model y ~ X
res <- y - X %*% coef # residuals
s2 <- sum(res * res) / df_residual
XtX <- crossprod(X)
XtX_inv <- qr.solve(XtX, diag(k))
std_err <- sqrt(s2 * diag(XtX_inv))
list(
coefficients = coef,
stderr = std_err,
df.residual = df_residual
)
}
quick_fast_lm <- quick(fast_lm)
Rcpp::sourceCpp(
code = '
#include <RcppArmadillo/Lighter>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
Rcpp::List rcpp_armadillo_fast_lm(const arma::mat& X, const arma::colvec& y) {
int n = X.n_rows, k = X.n_cols;
arma::colvec coef = arma::solve(X, y); // fit model y ~ X
arma::colvec res = y - X*coef; // residuals
double s2 = arma::dot(res, res) / (n - k); // std.errors of coefficients
arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
Rcpp::Named("stderr") = std_err,
Rcpp::Named("df.residual") = n - k);
}')set.seed(1)
beta <- c(0.5, 1.0, -2.0, 10, 5)
X <- cbind(1, matrix(rnorm(3 * 10^5), ncol = 4))
y <- as.vector(X %*% beta + rnorm(nrow(X), sd = 2))
timings <- bench::mark(
R = fast_lm(X, y),
quickr = quick_fast_lm(X, y),
RcppArmadillo = rcpp_armadillo_fast_lm(X, y)
)
timings
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 R 1.28ms 1.82ms 545. 13.7MB 148.
#> 2 quickr 1.65ms 1.73ms 575. 0B 0
#> 3 RcppArmadillo 1.67ms 2.02ms 491. 0B 0
plot(timings)Use declare(parallel()) to annotate the next for loop or sapply()
call for OpenMP parallelization. Parallel loops must be
order-independent: avoid shared-state updates or inter-iteration
dependencies. OpenMP adds overhead, so it can be slower for small
workloads, but substantially faster for larger ones.
Here is a concrete example using colSums(). At smaller sizes, the
quickr serial version is fastest (even faster than base colSums()). As
sizes grow, the two serial versions converge, and the parallel version
pulls ahead. However, the speedup is not linear with core count (e.g.,
with 12 cores, the speedup is closer to ~6x).
colSums_quick_parallel <- quick(function(x) {
declare(type(x = double(NA, NA)))
declare(parallel())
sapply(seq_len(nrow(x)), \(r) sum(x[, r]))
})
colSums_quick <- quick(function(x) {
declare(type(x = double(NA, NA)))
sapply(seq_len(nrow(x)), \(r) sum(x[, r]))
})
r <- bench::press(
n = 2^(4:14),
{
m <- array(runif(n * n), c(n, n))
bench::mark(
parallel = colSums_quick_parallel(m),
quick = colSums_quick(m),
base = colSums(m),
)
},
.quiet = TRUE
)
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
r |>
mutate(.before = 1,
desc = attr(expression, "description")) |>
select(desc, n, median) |>
ggplot(aes(x = n, y = median, color = desc)) +
geom_point() + geom_line() +
scale_x_log10() + bench::scale_y_bench_time()quickr does not set OpenMP thread counts. To control threads, set
OMP_NUM_THREADS (and optionally OMP_THREAD_LIMIT or OMP_DYNAMIC)
before calling a compiled function, e.g.
Sys.setenv(OMP_NUM_THREADS = "4").
When called in a package, quick() will pre-compile the quick functions
and place them in the ./src directory. Run devtools::load_all() or
quickr::compile_package() to ensure that the generated files in
./src and ./R are in sync with each other.
In a package, you must provide a function name to quick(). For
example:
my_fun <- quick(name = "my_fun", function(x) ....)You can install quickr from CRAN with:
install.packages("quickr")You can install the development version of quickr from GitHub with:
# install.packages("pak")
pak::pak("t-kalinowski/quickr")You will also need a C and Fortran compiler, preferably the same ones used to build R itself.
On macOS:
-
Make sure xcode tools and gfortran are installed, as described in https://mac.r-project.org/tools/. In Terminal, run:
sudo xcode-select --install # curl -LO https://mac.r-project.org/tools/gfortran-12.2-universal.pkg # R 4.4 # sudo installer -pkg gfortran-12.2-universal.pkg -target / curl -LO https://mac.r-project.org/tools/gfortran-14.2-universal.pkg # R 4.5 sudo installer -pkg gfortran-14.2-universal.pkg -target /
-
Optional: install
flang-newvia Homebrew (used by quickr on macOS when available):brew install flang
On Windows:
- Install the latest version of Rtools
On Linux:
- The “Install Required Dependencies” section here provides detailed instructions for installing R build tools on various Linux flavors.