In randomized experiments, the average treatment effect (ATE) on a post-period outcome is often estimated by a simple difference in means. That estimator is unbiased, but it can be noisy when units differ a lot in baseline “potential” outcomes. A common fix is to use pre-experiment information (a pre-period outcome
This repository simulates an ad-style RCT (Randomized Controlled Trial; treatment data/synthetic_ads.py), then compares several ATE estimators that differ in how aggressively they exploit config.py (sample size, noise, number of replications).
Run:
python3 run_comparison.pyPlots and formatted tables are written under output/.
We observe post-period outcome
Every estimator below is built to target that same
We write outcomes with a period subscript so “y” is not a single ambiguous symbol:
-
$y_{\text{pre}}$ : pre-period (pre-experiment) outcome for each unit—what you would have measured before treatment starts (t < 0). -
$y_{\text{post}}$ : post-period outcome used for the experiment’s primary analysis and for$\tau$ (t$\ge$ 0).
ancova_zx, Double ML); CUPED-style rows use
Noise (simulator only): the synthetic data generating process (DGP) in data/synthetic_ads.py builds a shared nonlinear signal
-
$\epsilon_\text{pre}$ ,$\epsilon_\text{post}$ : idiosyncratic noise in$y_{\text{pre}}$ and$y_{\text{post}}$ (scaled byDGP_Y_PRE_NOISE_SCALEandDGP_Y_POST_NOISE_SCALEinconfig.py). They are independent of each other and of$w$ .
true_ate * w. Details and functional form live in AdsRCTDGP in code, not in the table.
Uses only
CUPED (Controlled-experiment Using Pre-Experiment Data) uses a pre-period outcome that is correlated with
Here “scalar” means each unit contributes one number
Define an adjusted outcome:
and estimate ATE by a plain difference in means on
Under random assignment,
Let
This quadratic is minimized at
which is the pooled OLS slope of
For a single realized RCT dataset, it is helpful to rewrite CUPED directly in terms of the naive point estimate and the observed pre-period shift between arms. Throughout this section, a contrast is a between-arm difference in sample means for a named outcome (e.g.
With a common reference
Interpretation:
- Under randomization,
$\mathbb{E}[\Delta_{\text{pre}}]=0$ , but in any finite sample$\Delta_{\text{pre}}$ is typically nonzero because treatment and control groups differ slightly in baseline composition. - The naive ATE is driven by between-arm mean differences in
$y_{\text{post}}$ . Part of the sampling variability in that between-arm contrast is tied to the same finite-sample imbalance in between-arm pre means,$\Delta_{\text{pre}}$ . When$y_{\text{pre}}$ is correlated with$y_{\text{post}}$ , the between-arm post contrast and$\Delta_{\text{pre}}$ covary across repeated experiments. Subtracting$\hat\theta\Delta_{\text{pre}}$ removes part of the sampling noise in the naive contrast. Under random assignment, it does not change the estimand: the average treatment effect (ATE) on$y_{\text{post}}$ . -
Single sample (one run). If, by chance, the treated group has a higher pre mean (
$\Delta_{\text{pre}}>0$ ) and pre predicts post ($\hat\theta>0$ ), then part of the naive post-period contrast$\hat\tau_{\text{naive}}$ is attributable to baseline composition, not the treatment. The term$-\hat\theta \cdot \Delta_{\text{pre}}$ subtracts that component so the adjusted estimate is closer to the contrast you would expect if pre means were perfectly balanced in that draw. If$\Delta_{\text{pre}}\approx 0$ , there is little imbalance to remove and CUPED is close to naive. -
Repeated sampling. Over hypothetical replications of the experiment, the variance of
$\hat\tau$ is smaller than that of$\hat\tau_{\text{naive}}$ when the adjustment successfully removes the covarying part of sampling noise. In a single realized trial you obtain one CUPED point estimate; its standard error and confidence interval are then typically narrower than for naive, because the sampling distribution of$\hat\tau$ is tighter (analytically, or from a unit-level bootstrap that resamples entities and recomputes$\hat\theta$ each resample). - If
$\Delta_{\text{pre}}=0$ in a particular draw, the adjustment term is zero and CUPED coincides with naive for that draw; the variance gain is a repeated-sampling property, not a guarantee in every single finite sample.
The two variants differ only in
-
cuped_known_pre_mean:$\mu$ is set to the configured “known” pre mean (DGP_PRE_MEAN_TARGETinconfig.py), aligned with how the simulator anchors$y_{\text{pre}}$ . -
cuped_one_theta:$\mu$ is the sample mean of$y_{\text{pre}}$ (pooled single$\hat\theta$ ).
When
Instead of one pooled slope
Per unit:
and
Instead of one make_multivariate_y_pre in data/synthetic_ads.py), then run the multivariate analogue: pooled linear adjustment using all pre columns (centered at default per-column means), then difference in means on the adjusted post outcome. Extra columns help when they carry non-overlapping predictive information about
Fit an OLS regression of
and read off
Double machine learning targets the same partially linear structure
but treats doubleml).
-
dml_plr_gbdt: flexible gradient boosting for the outcome nuisance$g(z)$ , with a random forest for$\mathbb{E}[w \mid z]$ (under RCT the latter is nearly flat, but the estimator is written generically). -
dml_plr_lasso: both nuisances useLassoCV—a linear, sparse high-dimensional workhorse. It shines when$g(z)$ is well approximated linearly; when$g(z)$ is curved or interactive, GBDT-style learners often fit better and can yield sharper ATE estimates.
run_comparison.py draws many independent synthetic datasets. For each replication it computes every output/mc_draws.txt. output/summary_table.txt combines point summaries (bias, RMSE, variance, var_vs_naive) with 90% bootstrap percentile intervals over Monte Carlo rows (MC_BOOTSTRAP_B, MC_BOOTSTRAP_SEED, MC_CI_LEVEL = 0.90 in config.py) for var (ascending). Figures: violin_ate_estimates.png, relative_variance.png (bars include bootstrap error bars), forest_tau.png (MC mean ATE with horizontal bootstrap intervals and a line at true var / best variance reduction at the top, same ordering idea as the summary table).
Let DGP_TRUE_ATE)—constant across draws—and let
| Column | Meaning |
|---|---|
| bias |
|
| rmse |
|
| var | Sample variance of |
| var_vs_naive |
|
tau_mean_90% |
Monte Carlo mean |
var_ratio_90% |
Point var_vs_naive with 90% bootstrap percentile interval for the ratio on each resample (mean ± [lo, hi]). |
Sorting and magnitudes change if you change noise scales, n, the number of MC replications, or the DGP. Bootstrap columns quantify Monte Carlo uncertainty in the reported summaries, not uncertainty from a single field experiment.
For the strength of pre/post alignment in a given run batch, see output/rho_summary.txt:
This section summarizes the latest Monte Carlo outputs written under output/ by python3 run_comparison.py (see top of this file).
The synthetic RCT is constructed so that output/rho_summary.txt):
-
$\rho = \mathrm{Corr}(y_{\text{post}}, y_{\text{pre}})$ : mean$\approx 0.684$ , std$\approx 0.013$ , min$\approx 0.648$ , max$\approx 0.720$
That correlation is exactly what scalar CUPED exploits; multivariate CUPED and DML can additionally exploit other predictive directions (multiple pre columns or nonlinear structure in
Rows in output/summary_table.txt are sorted by sampling variance var (ascending). Here are the key columns (bias, RMSE, var, and variance ratio vs naive):
| Method | bias | rmse | var | var_vs_naive |
|---|---|---|---|---|
| dml_plr_gbdt | −0.00191 | 0.01334 | 0.00018 | 0.32776 |
| cuped_multi | −0.00074 | 0.01370 | 0.00019 | 0.35215 |
| cuped_known_pre_mean | −0.00091 | 0.01665 | 0.00028 | 0.52011 |
| cuped_one_theta | −0.00091 | 0.01665 | 0.00028 | 0.52011 |
| cuped_two_theta | −0.00088 | 0.01666 | 0.00028 | 0.52061 |
| ancova_zx | −0.00128 | 0.02155 | 0.00047 | 0.87074 |
| dml_plr_lasso | −0.00130 | 0.02174 | 0.00048 | 0.88613 |
| naive | −0.00084 | 0.02307 | 0.00054 | 1.00000 |
Interpretation:
- Variance reduction: every adjusted method has
var_vs_naive < 1. The best precision in this run isdml_plr_gbdt(about 0.33× the naive variance), followed bycuped_multi(0.35×). Scalar CUPED variants cluster at 0.52×, and linear adjustment paths (ancova_zx,dml_plr_lasso) are only modestly better than naive (0.87–0.89×). - Bias vs RMSE: biases are small relative to RMSE, so error is dominated by variance rather than systematic shift.
- Bootstrap intervals: the table also reports 90% bootstrap intervals over Monte Carlo rows for the MC mean ATE and for the variance ratio; these quantify Monte Carlo uncertainty in the reported summaries.
All figures are generated from the same MC draws and are ordered consistently by sampling variance.
-
Relative variance vs naive (
output/relative_variance.png): bars showVar(τ̂)/Var(naive)with 90% bootstrap intervals.- Ordering: methods run left → right by increasing sampling variance (same idea as the summary table); the dashed line at 1 is naive.
-
Takeaway:
dml_plr_gbdtandcuped_multishow the largest variance deflation; linear-on-$z$ methods sit only modestly below 1 (often ~0.87–0.9× naive in this run), with comparatively short bars but ratios still far from the best few.
-
Forest plot of MC mean ATE (
output/forest_tau.png): points show MC mean$\hat\tau$ with 90% bootstrap intervals; the dashed vertical line is the true ATE.-
Precision: narrower horizontal intervals at the top = lower
var/ more stable$\hat\tau$ across MC replications;naiveis widest at the bottom. -
Coverage: in this snapshot, every interval crosses the true
$\tau$ line—differences are mostly about width (efficiency), not obvious systematic miss.
-
Precision: narrower horizontal intervals at the top = lower
-
Violin plot of ATE draws (
output/violin_ate_estimates.png): full distribution of$\hat\tau$ across replications; dashed horizontal line is the true ATE.-
Spread: violins get wider left → right as
varincreases;dml_plr_gbdtis the tightest cloud,naivethe most dispersed. - Centering: box medians sit near the true ATE for all methods here; the plot emphasizes sampling noise more than large bias.
-
Spread: violins get wider left → right as
- With
$\rho \approx 0.68$ , CUPED delivers meaningful variance reduction; in this DGP, multivariate CUPED improves further over scalar CUPED. -
DML with a flexible learner (
dml_plr_gbdt) is most efficient here, consistent with nonlinear outcome structure. - Linear adjustments on
$z$ (ancova_zx,dml_plr_lasso) are close to each other and only slightly better than naive in this configuration.
These rankings can change with config.py (sample size, noise scales, DGP details, number of MC replications).
For scalar CUPED (one
where
In the current run outputs (from output/mc_draws.txt / output/rho_summary.txt):
-
$\bar\rho \approx 0.684$ so$1 - \bar\rho^2 \approx 0.532$ - empirical
$\mathrm{Var}(\hat\tau_{\text{cuped}}) / \mathrm{Var}(\hat\tau_{\text{naive}}) \approx 0.520$ forcuped_one_theta
In practice the empirical variance ratio and


