tidyhte
provides tidy semantics for estimation of heterogeneous
treatment effects through the use of Kennedy’s (2023) doubly-robust
learner.
The package includes comprehensive automated tests and continuous integration across multiple platforms, ensuring reliability for production research use.
While heterogeneous treatment effect estimation has become increasingly
important in applied research, existing tools often require substantial
statistical expertise to implement correctly. Researchers must navigate
complex decisions about cross-validation, model selection, and valid
inference. tidyhte
addresses these challenges by:
- Implementing state-of-the-art doubly-robust methods with automatic cross-validation
- Using intuitive “recipe” semantics familiar to R users
- Scaling easily from single to multiple outcomes and moderators
- Providing built-in diagnostics for model quality
- Returning results in tidy formats for easy visualization
This makes modern HTE methods accessible to applied researchers who need to understand treatment effect variation but may not be causal inference and machine learning experts.
tidyhte
is designed to support research across multiple domains where
understanding treatment effect variation is crucial:
- Clinical Trials: Identify patient subgroups that benefit most from medical treatments
- Policy Evaluation: Understand which populations are most affected by policy interventions
- Technology & A/B Testing: Optimize product features for different user segments
- Economics: Study heterogeneous effects of economic policies across demographics
- Education: Evaluate differential impacts of educational interventions
The best place to start for learning how to use tidyhte
are the
vignettes which run through example analyses from start to finish:
vignette("experimental_analysis")
and
vignette("observational_analysis")
. There is also a writeup
summarizing the method and implementation in
vignette("methodological_details")
, which includes a quasi-real world
example using the Palmer Penguins dataset.
For a quick start with default settings, simply use basic_config()
:
library(tidyhte)
# Use all defaults - linear models for nuisance functions
results <- data %>%
attach_config(basic_config()) %>%
make_splits(user_id) %>%
produce_plugin_estimates(outcome, treatment, x1, x2, x3) %>%
construct_pseudo_outcomes(outcome, treatment) %>%
estimate_QoI(x1, x2)
When choosing machine learning algorithms for the ensemble, consider a progression like the following based on your subject-matter expertise:
- Start simple:
"SL.glm"
(linear models) for initial exploration - Add interactions:
"SL.glm.interaction"
when you expect treatment effects to vary with covariates - Regularization:
"SL.glmnet"
for higher-dimensional data (e.g. many interactions, noteSL.glmnet.interaction
) or when overfitting is a concern - Flexible models:
"SL.ranger"
(random forests) or"SL.xgboost"
(gradient boosting) when relationships are complex
The SuperLearner ensemble will automatically weight these models. Start
with 2-3 algorithms and add complexity as needed. See
SuperLearner::listWrappers()
for all available options.
Install the released version of tidyhte from CRAN:
install.packages("tidyhte")
Or install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("ddimmery/tidyhte")
To set up a simple configuration, it’s straightforward to use the Recipe
API. For complete examples with data, see
vignette("experimental_analysis")
and
vignette("observational_analysis")
.
library(tidyhte)
library(dplyr)
hte_cfg <- basic_config() %>%
add_propensity_score_model("SL.glmnet") %>%
add_outcome_model("SL.glmnet") %>%
add_moderator("Stratified", x1, x2) %>%
add_moderator("KernelSmooth", x3) %>%
add_vimp(sample_splitting = FALSE)
The basic_config
includes a number of defaults: it starts off the
SuperLearner ensembles for both treatment and outcome with linear models
("SL.glm"
)
data <- data %>%
attach_config(hte_cfg) %>%
make_splits(userid, .num_splits = 12) %>%
produce_plugin_estimates(
outcome_variable,
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(outcome_variable, treatment_variable)
results <- data %>%
estimate_QoI(covariate1, covariate2)
Getting information on estimated CATEs for a moderator not included previously would just require rerunning the final line:
results <- data %>%
estimate_QoI(covariate3)
Replicating this on a new outcome would be as simple as running the following, with no reconfiguration necessary.
results <- data %>%
attach_config(hte_cfg) %>%
produce_plugin_estimates(
second_outcome_variable,
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(second_outcome_variable, treatment_variable) %>%
estimate_QoI(covariate1, covariate2)
This leads to the ability to easily chain together analyses across many outcomes in an easy way:
library("foreach")
data <- data %>%
attach_config(hte_cfg) %>%
make_splits(userid, .num_splits = 12)
foreach(outcome_str = list_of_outcome_strs, .combine = "bind_rows") %do% {
data %>%
produce_plugin_estimates(
.data[[outcome_str]],
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(outcome, treatment_variable) %>%
estimate_QoI(covariate1, covariate2) %>%
mutate(outcome = outcome_str)
}
The function estimate_QoI
returns results in a tibble format which
makes it easy to manipulate or plot results.
HTE estimation with cross-validation and ensemble machine learning can be computationally intensive. Plan accordingly for larger datasets or analyses with many outcomes.
Parallelism is not currently managed through the package directly, but
can be easily supported using a parallel backend with foreach
:
library(doParallel)
registerDoParallel(cores = 4)
foreach(outcome_str = outcome_list_str, .combine = "bind_rows") %dopar% {
data %>%
produce_plugin_estimates(.data[[outcome_str]], treatment, covariates) %>%
construct_pseudo_outcomes(.data[[outcome_str]], treatment) %>%
estimate_QoI(moderators)
}
There are two main ways to get help:
If you have a problem, feel free to open an issue on GitHub. Please try to provide a minimal reproducible example. If that isn’t possible, explain as clearly and simply why that is, along with all of the relevant debugging steps you’ve already taken.
Support for the package will also be provided in the Experimentation
Community Discord:
You are welcome to come in and get support for your usage in the
tidyhte
channel. Keep in mind that everyone is volunteering their time
to help, so try to come prepared with the debugging steps you’ve already
taken.
Please note that the tidyhte project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.