-
Detailed protocols are in the
protocolfolder. -
If you are interested in how we defined our code lists, look in the
codelistsfolder. -
Analyses scripts are in the
analysisdirectory:-
Dataset definition scripts are in the
dataset_definitiondirectory:- If you are interested in how we defined our variables, we use the variable script
variable_helper_fuctionsto define functions that generate variables. We then apply these functions invariables_cohortsto create a dictionary of variables for cohort definitions, and invariables_datesto create a dictionary of variables for calculating study start dates and end dates. - If you are interested in how we defined study dates (e.g., index and end dates), these vary by cohort and are described in the protocol. We use the script
dataset_definition_datesto generate a dataset with all required dates for each cohort. This script imported all variables generated fromvariables_dates. - If you are interested in how we defined our cohorts, we use the dataset definition script
dataset_definition_cohortsto define a function that generates cohorts. This script imports all variables generated fromvariables_cohortsusing the patient's index date, the cohort start date and the cohort end date. This approach is used to generate three cohorts: pre-vaccination, vaccinated, and unvaccinated—found indataset_definition_prevax,dataset_definition_vax, anddataset_definition_unvax, respectively. For each cohort, the extracted data is initially processed in the preprocess data scriptpreprocess data script, which generates a flag variable for pre-existing respiratory conditions and restricts the data to relevant variables.
- If you are interested in how we defined our variables, we use the variable script
-
Dataset cleaning scripts are in the
dataset_cleandirectory:- This directory also contains all the R scripts that process, describe, and analyse the extracted data.
dataset_cleanis the core script which executes all the other scripts in this folderfn-preprocessis the function carrying out initial preprocessing, formatting columns correctlyfn-modify_dummyis called from within fn-preprocess.R, and alters the proportions of dummy variables to better suit analysesfn-inexis the inclusion/exclusion functionfn-qais the quality assurance functionfn-refis the function that sets the reference levels for factors
-
Modelling scripts are in the
modeldirectory:make_model_input.Rworks with the output ofdataset_cleanto prepare suitable data subsets for Cox analysis. Combines each outcome and subgroup in one formatted .rds file.fn-prepare_model_input.Ris a companion function tomake_model_input.Rwhich handles the interaction withactive_analyses.rds.cox-ipwis a reusable action which uses the output ofmake_model_input.Rto fit a Cox model to the data.make_model_output.Rcombines all the Cox results in one formatted .csv file.
-
The script for generating a random 10% sample of the study population is in the
generate_subsampledirectory:generate_subsample.Rgenerates the subsample itself. The subsample is randomly sampled, but for reproducibility-sake, the seed is set in the program.
-
The script for conducting variable selection using a LASSO (Least absolute shrinkage and selection) model is in the
lasso_var_selectiondirectory:lasso_var_selection.Rfits a cox-regression model (family = "cox") using the subsample data (10% subsample as generated bygenerate_subsample.R) and applying a LASSO penalty function (alpha = 1). The regularisation parameter lambda is tuned using cross-validation (cv.glmnet) to minimise cvm (mean cross-validated error). The result is a subset of selected variables whose corresponding coefficient does not shrink to zero. For further information please see the documentation for the glmnet and cv.glmnet functions:
-
The script for conducting variable selection using a LASSO X (Least absolute shrinkage and selection for exposure) model which takes the exposure (COVID-19) as the response variable is in the
lasso_X_var_selectiondirectory:lasso_X_var_selection.Rfits a logistic regression (family = "binomial") using binary exposure (X) to COVID-19 as the response variable and excluding the oucomes (Y, acute MI and subarachnoid haemorrhage / haemorrhage stroke) from the dataset. The model is fit using the subsample data (10% subsample as generated bygenerate_subsample.R). LASSO penalty is applied (alpha = 1). The regularisation parameter lambda is tuned using cross-validation (cv.glmnet) to minimise cvm (mean cross-validated error). The result is a subset of selected variables whose corresponding coefficient does not shrink to zero. For further information please see the documentation for the glmnet and cv.glmnet functions:
-
The script for conducting variable selection using a Union LASSO (Least absolute shrinkage and selection) model is in the
lasso_union_var_selectiondirectory:lasso_union_var_selection.Rtakes the union of the two variable sets selected bylasso_var_selection.Randlasso_X_var_selection.R.
-
The script which implements the Hartwig et al., 2024 empirical unconfoundedness test is in the
unconfoundedness_testdirectory:unconfoundedness_test.Rperforms the empirical unconfoundedness test in the following manner:- A cox-regression model taking the oucomes (Y) as the response is fit in the same manner as in
lasso_var_selection.R. - A logistic regression model taking the exposure (X) as the response is fit in the same manner as in
lasso_X_var_selection.R. - These two regression models are used to evaluate associations of each confounder (Z) with the exposure (X) and outcome (Y) in the following manner:
- For every candidate confounder Z, condition (i) is checked (is Z associated with (i.e., not independent of) X given all other covariates?)
- For every candidate confounder Z, condition (ii) is checked (are Z and Y are conditionally independent given X and all other covariates?)
- If any covariate Z satisfies both (i) and (ii), then the covariate set is sufficient for confounding adjustment. If not, then the test is inconclusive.
- Test conditions, coefifcient values, p-values and standard errors are saved for each confounder Z.
- A cox-regression model taking the oucomes (Y) as the response is fit in the same manner as in
-
-
The
active_analysescontains a list of active analyses. -
The
project.yamldefines run-order and dependencies for all the analysis scripts. This file should not be edited directly. To make changes to the yaml, edit and run thecreate_project_actions.Rscript which generates all the actions. -
Descriptive and Model outputs, including figures and tables are in the
released_outputsdirectory.
Outputs follow OpenSAFELY naming conventions related to suppression rules by adding the suffix "_midpoint6". The suffix "_midpoint6_derived" means that the value(s) are derived from the midpoint6 values. Detailed information regarding naming conventions can be found here.
The OpenSAFELY framework is a Trusted Research Environment (TRE) for electronic health records research in the NHS, with a focus on public accountability and research quality.
Read more at OpenSAFELY.org.
As standard, research projects have a MIT license.