MLBC
is an R package for correcting bias and performing valid inference in regressions that include variables generated by AI/ML methods. The bias-correction methods are described in Battaglia, Christensen, Hansen & Sacher (2024).
MLBC
runs on R 3.5 or above and uses TMB
. It can be installed from CRAN by running
To install the package, run
pip install ValidMLInference
in your R console.
To get started, we recommend looking at the following examples and resources:
- Remote Work: This notebook estimates the association between working from home and salaries using real-world job postings data (Hansen et al., 2023). It illustrates how the functions
ols_bca
,ols_bcm
andone_step
can be used to correct bias from regressing on AI/ML-generated labels. The notebook reproduces results from Table 1 of Battaglia, Christensen, Hansen & Sacher (2024). - Topic Models: This notebook estimates the association between CEO time allocation and firm performance (Bandiera et al. 2020). It illustrates how the functions
ols_bca_topic
andols_bcm_topic
can be used to correct bias from estimated topic model shares. The notebook reproduces results from Table 2 of Battaglia, Christensen, Hansen & Sacher (2024). - Synthetic Example: A synthetic example comparing the performance of different bias-correction methods in the context of AI/ML-generated labels.
- Manual: A detailed reference describing all available functions, optional arguments, and usage tips.
Code below compares coefficients obtained by ordinary least squares methods and those obtained by the one_step
approach, when used on variables subject to classification error. We can see that the 95% confidence interval generated by one_step
contains the true parameter of 2, whereas the standard ols approach doesn't.
library(MLBC)
# Generate synthetic data with mislabeling
n <- 1000
true_effect <- 2.0
# True treatment assignment
X_true <- rbinom(n, 1, 0.5)
# Observed (mislabeled) treatment with 20% error rate
mislabel_prob <- 0.2
X_obs <- X_true
mislabel_mask <- rbinom(n, 1, mislabel_prob) == 1
X_obs[mislabel_mask] <- 1 - X_obs[mislabel_mask]
# Generate outcome with true treatment effect
Y <- 1.0 + true_effect * X_true + rnorm(n, 0, 1)
# Create DataFrame
data <- data.frame(Y = Y, X_obs = X_obs)
# Naive OLS using mislabeled data
ols_result <- ols(Y ~ X_obs, data = data)
print("OLS Results (using mislabeled data):")
#> [1] "OLS Results (using mislabeled data):"
print(summary(ols_result))
#>
#> MLBC Model Summary
#> ==================
#>
#> Formula: Y ~ Beta_0 + Beta_1 * X_obs
#>
#>
#> Coefficients:
#>
#> Estimate Std.Error z.value Pr(>|z|) Signif 95% CI
#> Beta_0 1.3346 0.0568 23.4937 < 2e-16 *** [1.2233, 1.4459]
#> Beta_1 1.2471 0.0809 15.4229 < 2e-16 *** [1.0886, 1.4056]
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# One-step estimator that corrects for mislabeling
one_step_result <- one_step(Y ~ X_obs, data = data)
print("\nOne-Step Results (correcting for mislabeling):")
#> [1] "\nOne-Step Results (correcting for mislabeling):"
print(summary(one_step_result))
#>
#> MLBC Model Summary
#> ==================
#>
#> Formula: Y ~ Beta_0 + Beta_1 * X_obs
#>
#> Number of observations: 1000
#> Log-likelihood: -2344.289
#>
#> Coefficients:
#>
#> Estimate Std.Error z.value Pr(>|z|) Signif 95% CI
#> Beta_0 0.9443 0.0852 11.0868 < 2e-16 *** [0.7774, 1.1113]
#> Beta_1 1.9803 0.1009 19.6202 < 2e-16 *** [1.7825, 2.1781]
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extract confidence intervals
ols_ci <- confint(ols_result)["X_obs", ]
one_step_ci <- confint(one_step_result)["X_obs", ]
cat("\nTrue treatment effect:", true_effect, "\n")
#>
#> True treatment effect: 2
cat("OLS 95% CI contains true value:",
ols_ci[1] <= true_effect && true_effect <= ols_ci[2], "\n")
#> OLS 95% CI contains true value: FALSE
cat("One-step 95% CI contains true value:",
one_step_ci[1] <= true_effect && true_effect <= one_step_ci[2], "\n")
#> One-step 95% CI contains true value: TRUE