This repository hosts the R package that implements the post-prediction adaptive infernece (pspa) method described in the paper: Assumption-Lean and Data-Adaptive Post-Prediction Inference.
pspa provides valid and powerful inference based on ML predictions for parameters defined through estimation equations.
# install.packages("devtools")
devtools::install_github("qlu-lab/pspa")
Here are examples of pspa with ML-predicted labels for M-estimation tasks including: mean estimation, linear regression, logistic regression, and Poisson regrssion. The main function is pspa_y(), where the argument method indicates which task to do.
# Load the package
library(pspa)
# Load the simulated data
set.seed(999)
data <- sim_data_y()
X_lab = data$X_lab ## Covariates in the labeled data
X_unlab = data$X_unlab ## Covariates in the unlabeled data
Y_lab = data$Y_lab ## Observed outcome in the labeled data
Yhat_lab = data$Yhat_lab ## Predicted outcome in the labeled data
Yhat_unlab = data$Yhat_unlab ## Predicted outcome in the unlabeled data
# Run pspa mean estimation
fit_mean <- pspa_y(Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab,
alpha = 0.05, method = "mean")
print(fit_mean)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# 1 1.623601 0.05514429 1.51552 1.731682 1.557956e-190 0.9226747
# Run pspa linear regression
fit_ols <- pspa_y(X_lab = X_lab, X_unlab = X_unlab,
Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab,
alpha = 0.05, method = "ols")
print(fit_ols)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# 1.6227593 0.05385971 1.517196 1.728322 1.998620e-199 0.8741176
# X1 0.8698039 0.07193445 0.728815 1.010793 1.169598e-33 1.0000000
# Load the simulated data
set.seed(999)
data <- sim_data_y(binary = T)
X_lab = data$X_lab
X_unlab = data$X_unlab
Y_lab = data$Y_lab
Yhat_lab = data$Yhat_lab
Yhat_unlab = data$Yhat_unlab
# Run pspa logistic regression
fit_logistic <- pspa_y(X_lab = X_lab, X_unlab = X_unlab,
Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab,
alpha = 0.05, method = "logistic")
print(fit_logistic)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# -0.1289001 0.08347881 -0.2925156 0.03471532 1.225626e-01 0.4290559
# X1 0.5749601 0.08653142 0.4053617 0.74455861 3.041970e-11 0.5337078
# Load the simulated data
set.seed(999)
data <- sim_data_y()
X_lab = data$X_lab
X_unlab = data$X_unlab
Y_lab = round(data$Y_lab - min(data$Y_lab))
Yhat_lab = round(data$Yhat_lab - min(data$Yhat_lab))
Yhat_unlab = round(data$Yhat_unlab - min(Yhat_unlab))
# Run pspa Poisson regression
fit_poisson <- pspa_y(X_lab = X_lab, X_unlab = X_unlab,
Y_lab = Y_lab, Yhat_lab = Yhat_lab, Yhat_unlab = Yhat_unlab,
alpha = 0.05, method = "poisson")
print(fit_poisson)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# 0.9732937 0.02261537 0.9289684 1.0176191 0.000000e+00 0.8392517
# X1 0.3188511 0.03125507 0.2575923 0.3801099 1.950752e-24 0.8303991
Here are examples of pspa with ML-predicted covariates for M-estimation tasks including: mean estimation, linear regression, logistic regression, and Poisson regrssion. The main function is pspa_x(), where the argument method indicates which task to do.
# Load the package
library(pspa)
# Load the simulated data
set.seed(999)
data <- sim_data_x(r = 0.5)
X_lab = data$X_lab ## Covariates in the labeled data
Xhat_lab = data$Xhat_lab ## Predicted covariates in the labeled data
Xhat_unlab = data$Xhat_unlab ## Predicted covariates in the unlabeled data
Y_lab = data$Y_lab ## Observed outcome in the labeled data
Y_unlab = data$Y_unlab ## Observed outcome in the unlabeled data
# Run pspa linear regression
fit_ols <- pspa_x(X_lab = X_lab, Xhat_lab = Xhat_lab,
Xhat_unlab = Xhat_unlab, Y_lab = Y_lab, Y_unlab = Y_unlab,
alpha = 0.05, method = "ols")
print(fit_ols)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# 0.503592340 0.009579156 0.484817539 0.52236714 0.0000000 0.7553199
# X1 -0.003066491 0.031646383 -0.065092261 0.05895928 0.9228069 0.3674327
# X2 0.011997846 0.009443294 -0.006510669 0.03050636 0.2039013 0.7553199
# Load the simulated data
set.seed(999)
data <- sim_data_x(r = 0.5, binary = T)
X_lab = data$X_lab ## Covariates in the labeled data
Xhat_lab = data$Xhat_lab ## Predicted covariates in the labeled data
Xhat_unlab = data$Xhat_unlab ## Predicted covariates in the unlabeled data
Y_lab = data$Y_lab ## Observed outcome in the labeled data
Y_unlab = data$Y_unlab ## Observed outcome in the unlabeled data
# Run pspa logistic regression
fit_logistic <- pspa_x(X_lab = X_lab, Xhat_lab = Xhat_lab,
Xhat_unlab = Xhat_unlab, Y_lab = Y_lab, Y_unlab = Y_unlab,
alpha = 0.05, method = "logistic")
print(fit_logistic)
# Estimate Std.Error Lower.CI Upper.CI P.value Weight
# 0.01293817 0.03826099 -0.06205199 0.08792834 0.7352459 0.7562264
# X1 -0.01242760 0.12668094 -0.26071767 0.23586247 0.9218516 0.3673090
# X2 0.04379359 0.03775431 -0.03020350 0.11779067 0.2460640 0.7562264
We provide the script for analysis in the pspa paper here.
Please submit an issue or contact Jiacheng ([email protected]) or Xinran ([email protected]) for questions.
Assumption-lean and Data-adaptive Post-Prediction Inference
Valid inference for machine learning-assisted GWAS
- POP-TOOLS (POst-Prediction TOOLS) is a toolkit for conducting valid and powerful machine learning (ML)-assisted genetic association studies. It currently implements
POP-GWAS, where statistical and computational methods are optimized for GWAS applications.