This repository trains and evaluates uplift / heterogeneous treatment effect models on a simulated digital-ads-style dataset. Each user has intent (Beta-distributed) and context (Gaussian), a binary treatment T, and a binary conversion Y (1 = converted). By default, treatment is assigned like an RCT (50/50, independent of features). Setting TREATMENT_PROB_SLOPE > 0 in config.py switches to biased assignment that mimics targeted outreach.
The simulator defines a known per-user treatment effect τ (see Simulation). That makes it possible to score models on observed outcomes (IPW-adjusted) and on oracle benchmarks that use the true τ.
- Compare meta-learners (T-, X-, DR-, R-Learner) that predict per-unit uplift against a random ranking baseline.
- Report ranking quality (Qini-style curves and excess AUC vs random), policy value in the top-scored slice, calibration of predictions against Hajek IPW effects, and agreement with true
τwhere available. - Support reproducible holdout Monte Carlo evaluation (multiple train/test splits with stratified treatment).
Each user’s conversion probability is a baseline (without treatment) plus uplift when treated: P(Y=1 | X, T) = clip(μ(X) + T·τ(X), 0, 1), with Y ~ Bernoulli(P). Baseline conversion probability μ(X) depends on intent and context (OUTCOME_* in config.py). Treatment assignment depends on intent only; τ(X) can also use context, so who benefits most is not always aligned with who gets treated.
True treatment effect τ(X) — implemented in config.cate, what learners should recover:
Coefficients are set in config.py (Simulator knobs).
- Treatment is binary: each user has
T = 0(no treatment) orT = 1(treated). - Users respond differently: uplift is heterogeneous, so treatment lift varies across users.
τis the per-user causal effect:τ(x) = E[Y(1) - Y(0) | X=x]— the expected conversion lift from treatment at profilex.
Who receives the treatment (e.g. sees the ad) is drawn from
The intent term is treatment selection bias: when TREATMENT_PROB_SLOPE > 0, users with higher intent are more likely to be treated, so the treated and control groups differ on intent before outcomes are compared. When TREATMENT_PROB_SLOPE = 0, that bias is turned off—everyone shares the same P(T=1) (an RCT). context does not enter assignment, but it can still enter uplift τ(X), so who benefits most from treatment need not match who was selected for treatment.
| Mode | Settings | Selection bias on intent |
|---|---|---|
| RCT (default) | TREATMENT_PROB_SLOPE = 0, TREATMENT_PROB_INTERCEPT = 0.5 |
None: 50/50 for all users. Treated and control are comparable on intent. |
| Biased / observational | TREATMENT_PROB_SLOPE > 0 |
Present: higher intent → higher P(T=1), as in targeted campaigns. IPW and propensity-based learners adjust for this. |
See Simulator knobs for parameter defaults.
- Outcome is sampled from baseline conversion probability plus
T · τ(X)(see Simulation). - In evaluation, models output uplift scores on holdout users.
- A policy is applied by treating the top-scored fraction (default 20%).
- Policy effect is measured in two ways: Observed/IPW (Hajek-IPW estimate from observed
YandT) and True (mean simulatorτin that top slice, as an oracle-style check).
- T-Learner
- Trains one outcome model on treated users and one on control users.
- Predicted uplift:
P(Y=1 | X, T=1) - P(Y=1 | X, T=0).
- X-Learner
- First learns treated/control outcome models.
- Builds imputed treatment effects for each group.
- Learns effect models from those imputed targets, then combines them with propensity weights:
τ̂(x) = ê(x) τ̂₀(x) + (1 − ê(x)) τ̂₁(x)(more weight on the surface trained where assignment is more likely).
- DR-Learner
- Learns propensity
e(X)and outcome modelsμ1(X),μ0(X). - Builds a doubly robust pseudo-outcome and regresses it to estimate
τ(X). - Typically more robust to nuisance-model error than simpler approaches.
- Learns propensity
- R-Learner
- Residualizes outcome and treatment:
- outcome residual:
Y - m(X) - treatment residual:
T - e(X)
- outcome residual:
- Learns
τ(X)from the residual-on-residual signal, with weighting by(T-e(X))^2.
- Residualizes outcome and treatment:
- Random baseline
- Does not learn treatment effect.
- Uses random scores (Gaussian with
σmatched to SD(τ) under the simulator) for ranking and Qini curves.
Ranking: models are ordered by Qini Δ (tie-break: Corr (true)). Policy metrics use the top 20% by predicted uplift (DEFAULT_POLICY_TOP_K). Propensity ê(X) for IPW is fit on train only.
| Metric | Meaning |
|---|---|
| Qini raw | Area under the curve: IPW treatment effect in the top f fraction of users, as f increases (higher = better ranking on observed data). |
| Qini Δ | Qini raw minus the median from random rankings (100 draws per split). Random baseline: Δ = 0. |
| Policy (IPW obs) | IPW treatment effect in the top 20% (from Y, T only). |
| Policy (true τ) | Mean simulator τ in that same top 20%. |
| Oracle policy value | Mean τ if the top 20% were chosen by true τ (best possible). |
| Regret (true τ) | Oracle policy value − policy (true τ). |
| Corr (true) | Correlation between predicted uplift and true τ on the full test set. |
| Avg uplift | Mean predicted uplift on the test set. |
- Observed columns use only
YandT(as in production data). - True columns use simulator τ (only available because data are simulated).
More detail: Appendix: metrics.
Report below uses config.py defaults: N_SAMPLES_DEFAULT = 50_000, RCT (TREATMENT_PROB_SLOPE = 0), 100 holdout splits (MONTE_CARLO_SPLITS). Random-null Qini median: 0.0248. Oracle top-20% benchmark: mean τ = 0.0585. Random baseline score scale: SD(τ) ≈ 0.0232.
| Rank | Model | Qini Δ | Qini raw | Policy (IPW) | Policy (true τ) | Regret | Avg uplift | Corr (true) |
|---|---|---|---|---|---|---|---|---|
| 1 | X-Learner | 0.0126 | 0.0374 | 0.0468 | 0.0541 | 0.0044 | 0.0265 | 0.8141 |
| 2 | DR-Learner | 0.0120 | 0.0368 | 0.0455 | 0.0531 | 0.0054 | 0.0266 | 0.7224 |
| 3 | T-Learner | 0.0120 | 0.0369 | 0.0460 | 0.0532 | 0.0053 | 0.0264 | 0.4455 |
| 4 | R-Learner | 0.0118 | 0.0366 | 0.0455 | 0.0529 | 0.0056 | 0.0261 | 0.7194 |
| 5 | Random | 0.0000 | 0.0248 | 0.0261 | 0.0258 | 0.0327 | 0.0001 | 0.0011 |
- Meta-learners beat random clearly. Random Qini raw is 0.0248 (chance-level curve area); learners reach ~0.037 (Qini Δ ~0.012). Random has ~zero Corr (true) and the largest regret (~0.033).
- X-Learner leads on ranking metrics. Highest Qini Δ, Policy (IPW), policy (true τ), and Corr (true) (~0.81). Regret is lowest (~0.004).
- Why X has the highest Corr (true): τ depends on intent, context, and their interaction; X-Learner imputes uplift in each arm and blends two τ models (Models), which tracks that structure over the full holdout. DR/R regress one noisy pseudo-outcome per row; T-Learner differences two binary outcome surfaces—so global ranking is weaker (~0.72 / ~0.45) even when top-20% policy is close to X’s.
- DR, T, and R cluster below X. Similar Qini Δ (~0.012) and policy (true τ) (~0.053); T-Learner lags on Corr (true) (~0.45) while DR/R are ~0.72.
- Campaign-style metrics separate models more than before. Policy (IPW) ranges ~0.046–0.047 for top learners vs 0.026 for random—useful when the product fixes a top-20% budget (
DEFAULT_POLICY_TOP_K). - Oracle gap remains. Perfect top-20% targeting would mean τ ≈ 0.0585; the best model reaches ~0.054—about 0.004 regret left.
- Score scales overlap. Avg uplift ~0.026 across learners; gains come from ordering, not very different mean predictions.
What it shows: How much treatment effect appears in the top f fraction of users, as f grows. Colored lines use Hajek IPW on observed outcomes. The shaded band is what random rankings usually achieve; the thin dark line is the oracle (sort by true τ, plot mean τ—not IPW).
What to notice: Learner curves sit above the shaded band but below the oracle. That matches the table: better than random, not perfect. The dashed line (median of many random rankings) and the solid “Random” line (the RandomPolicy baseline) are two different references; both sit below the learners.
What it shows: How predicted uplift scores are spread for each model.
What to notice: Meta-learners overlap in a similar range (~0.02–0.03). Random is centered near zero. The models disagree mainly on ordering, not on very different average scores.
What it shows: Hajek IPW treatment effect in each decile of predicted uplift (lowest decile → highest).
What to notice: A useful model’s curve rises toward the right: users scored as high-uplift should show higher observed lift. Flat or jumpy deciles mean the score is a weak guide on real data, even when Corr (true) looks reasonable.
python3 main.pyThis fits learners, aggregates metrics across MONTE_CARLO_SPLITS holdouts (see config.py), prints the leaderboard, and updates the plots under Evaluation report.
Short examples use round numbers from the evaluation table.
- Sort holdout users by predicted uplift (highest first).
- For each target fraction f (e.g. top 10%, 20%, …), estimate the treatment effect in that slice using Hajek IPW on
YandT. - Plot effect vs f → Qini curve. Qini raw is the area under that curve.
Example: If the top 10% slice shows a +0.03 IPW effect and the top 50% shows +0.02, a good model’s curve tends to be high early (strong effect while targeting only the best users). Qini raw summarizes that shape in one number.
Random rankings still get a nonzero area by chance. Qini Δ subtracts the median curve from many random rankings (100 per split).
Example: If Qini raw = 0.0374 for a learner but a meaningless ranking typically yields Qini raw ≈ 0.0248, then Qini Δ = 0.0126 — a clear edge beyond chance.
IPW reweights treated and control rows using propensity ê(X) so comparisons are less driven by who was selected for treatment. That matters most when TREATMENT_PROB_SLOPE > 0 (Treatment assignment). With the default RCT, groups are already balanced on intent; IPW is still applied for a consistent definition across runs.
Hajek normalization rescales weights so treated and control weights each sum to 1, which stabilizes the estimate.
Example: If a high-intent user was treated but ê(X) = 0.8 (likely to be treated anyway), that row gets less weight than a treated user with ê(X) = 0.2, so the comparison is not dominated by “easy to predict” assignments.
Policy metrics fix one decision rule: treat the top 20% by the model’s score.
Example: Oracle policy value = 0.0585 — if the top 20% were picked using true τ, mean lift in that group would be about 5.9 percentage points. X-Learner reaches policy (true τ) = 0.0541 with regret = 0.0044. Random reaches only 0.0258 with regret 0.0327.
Corr (true) checks alignment over all test users, not only the top 20%.
Example: Corr (true) = 0.814 for X-Learner vs 0.445 for T-Learner — both can have similar Qini Δ (~0.012) while global agreement with true τ differs a lot.
All values below are the defaults in config.py. Change them there and re-run python3 main.py to regenerate data, metrics, and figures. Implementation: data/simulate_ads.py (sampling and Y), config.cate() (true τ).
Data generating process (per user)
where conversion ~ Binomial(1, P(Y=1)).
| Constant | Default | Role |
|---|---|---|
N_SAMPLES_DEFAULT |
50_000 | Rows simulated per run. |
BETA_INTENT_A, BETA_INTENT_B |
2, 5 | intent ~ Beta(a, b) (skewed toward 0). |
CONTEXT_MEAN, CONTEXT_STD |
0.0, 1.0 | context ~ Normal(mean, std). |
PROB_CLIP_MIN, PROB_CLIP_MAX |
0.0, 1.0 | Clip conversion probability into valid range. |
Treatment assignment (intent only — not used in τ)
| Constant | Default | Role |
|---|---|---|
TREATMENT_PROB_INTERCEPT |
0.5 | When SLOPE = 0, this is P(T=1) for everyone (default: 50% treated). When SLOPE > 0, treated probability at intent = 0. |
TREATMENT_PROB_SLOPE |
0.0 | 0 = RCT (assignment does not depend on intent). > 0 = higher intent → more likely treated. See Treatment assignment. |
Baseline outcome μ(X) (conversion probability without treatment)
| Constant | Default | Role |
|---|---|---|
OUTCOME_BASE |
0.01 | Baseline conversion probability at intent=context=0, T=0. |
OUTCOME_INTENT_COEF |
0.01 | Higher intent → higher baseline conversion. |
OUTCOME_CONTEXT_COEF |
0.01 | Higher context → higher baseline conversion. |
CATE_* — coefficients in the τ formula (config.cate)
| Constant | Default | Role |
|---|---|---|
CATE_INTERCEPT |
0.02 | Base uplift when other terms are zero. |
CATE_INTENT_SLOPE |
0.02 | Intent-driven heterogeneity in lift. |
CATE_CONTEXT_SLOPE |
0.02 | Context-driven lift (context not in propensity). |
CATE_INTENT_CONTEXT_COEF |
0.01 | Intent × context interaction in lift. |
CATE_CONTEXT_THRESHOLD |
0.5 | Threshold on context for bonus term. |
CATE_CONTEXT_THRESHOLD_BONUS |
0.0 | Extra uplift when context exceeds threshold (0 = disabled). |
Evaluation / models (not part of the DGP, but commonly tuned)
| Constant | Default | Role |
|---|---|---|
MONTE_CARLO_SPLITS |
100 | Holdout replicates for reported metrics. |
HOLDOUT_TEST_SIZE |
0.4 | Test fraction per split. |
DEFAULT_POLICY_TOP_K |
0.2 | Top fraction for policy / oracle policy value. |
QINI_FRAC_MIN, QINI_FRAC_MAX |
0.05, 1.0 | Smallest / largest targeted fraction on the Qini curve (top 5% → 100%). |
QINI_N_BINS |
20 | Points along the Qini fraction grid. |
QINI_MIN_PREFIX_SAMPLES |
500 | Minimum users in each Qini prefix (stabilizes IPW). |
METRIC_DECIMALS |
4 | Decimal places in the printed report. |
GB_* / GRADIENT_BOOSTING_PARAMS |
see config.py |
Gradient boosting hyperparameters for all learners. |
PROPENSITY_CLIP_LOW, PROPENSITY_CLIP_HIGH |
0.01, 0.99 | Clip ê(X) for IPW stability. |
QINI_NULL_DRAWS, QINI_PLOT_NULL_BAND_DRAWS |
100, 100 | Random-ranking null for Qini Δ and plot band. |
RANDOM_POLICY_SCORE_STD is computed automatically as SD(τ) under the simulator (Monte Carlo over 100k draws; ≈ 0.0232 with current CATE_*), used by the random baseline.


