Meta-analysis of K and O serotype distributions and coverage, from Klebsiella pneumoniae neonatal sepsis in African and South Asian countries

This repository includes data and code from the paper: "Distribution of capsule and O types in Klebsiella pneumoniae causing neonatal sepsis in Africa and South Asia: meta-analysis of genome-predicted serotype prevalence and potential vaccine coverage" (Stanton et al, 2025).

Inputs to the modelling are in data_*/, model outputs are in outputs_*/, and the fitted models (R objects) are available in figshare.

This repository includes all R code to run the models, process the outputs, and run diagnostics (*.R files, see details below).

Directories figures/ and tables/ contain all figures and tables from the paper, as generated by R code in the *.Rmd files.

Figures and tables from the paper

Directories figures and tables contain all figures and tables from the paper, including supporting and appendix files.

The code to generate these are in the following R markdown files, which draw on functions in seroepi_functions.R:

DataAnalysis.Rmd - R code to generate figures and tables based on line-list data (data in tables/TableS3_sampleInfo.tsv)

DataAnalysis_ModelledEstimates.Rmd - R code to generate figures and tables based on modelled estimates of K/O prevalence (see details of Bayesian modelling below)

DataAnalysis_Longitudinal.Rmd - R code to generate Figure S6, based on longitudinal line-list data from 3 sites (data in tables/longitudinal_data.tsv)

An interactive web application, implemented in R shiny, is also available to (i) explore the modelled estimates for prevalence and coverage data, and (ii) undertake additional analyses of the raw data in Table S3 (e.g. to explore crude pooled estimates of prevalence and coverage for different subsets of loci and/or different subsets of samples).

Model Run File Explanation

Loads the file: model_run_file.R

Overview

This R script automates the process of preparing and analysing CSV data by loading essential libraries, listing and validating files in designated directories against a specific naming convention, extracting key metadata from the filenames, and merging the data with site information to incorporate region data; it then expands the dataset to include every combination of locus, subgroup, and site—to account for zero recordings of a given locus at a given site—and fits a Bayesian generalised linear mixed effects model using the brms package with a binomial link function, generic priors, and control parameters, before saving the resulting model as an RDS file with a name derived from the extracted metadata.

Dependencies

Ensure you have the following R packages installed: -

brms

dplyr

tidyr

ggplot2

ggrepel

tibble

stringr

tidyverse

File and Directory Structure

Data Files: CSV files are located in data_core/ or data_LOO/. A site_info.csv file must also reside in that directory.
Models: The resulting models will be saved in models_core/ or models_LOO/, depending on whether the script detects “core” or “LOO” in the file path.

The file begins by loading the above R packages needed for Bayesian modeling and data manipulation, then:

Lists and Validates Data Files:
- Searches the data_core/ (or data_LOO/) folder for CSV files.
- Checks whether these files match a particular naming pattern (stored in the variable "pattern").
- Flags any files that don’t conform to this format.
Metadata Extraction:
- For each file, it extracts key information:
  - The type (e.g., "Full", "Carba", "ESBL", "Fatal").
  - Whether the file is "core" or "LOO" (Leave-One-Out).
  - The date stamp (YYYYMMDD).
  - The number of days (28 or 365).
  - The data type (e.g., "ALL" or "min10").
  - The antigen type (e.g., "K", "O", or "OlocusType").
- The purpose of this section is to ensure models are saved appropriately corresponding to the data they were run on.
Data Preparation:
- Reads each CSV file into R.
- Merges each file with site-level metadata from a site_info.csv file to include region data.
- Expands the dataset to ensure every combination of locus, subgroup, and site is explicitly listed—even if there are zero events.
Bayesian Modeling:
- Uses the brms package to fit a logistic regression model:
  
  event | trials(n) ~ 0 + locus + subgroup + (1 | Site) + (1 | locus:subgroup)
- Employs a binomial link function, generic priors, and runs the model with a 6000 iterations and specific control parameters (adapt_delta = 0.999999, max_treedepth = 55).
Saving:
- Checks model diagnostics such as maximum Rhat values and minimum effective sample size ratios.
- Saves the fitted model as an RDS file, naming it based on the extracted metadata.

Model Process File Explanation

Loads the file: model_process_file.R

Overview

This R script automates the process of reloading and processing Bayesian models previously fitted by the model_run_file.R. It begins by loading the required libraries and listing CSV data files from the designated directories, validating these files filenames. It then extracts key metadata from the filenames—including type, purpose (core or LOO), date stamp, duration, data type, and antigen—and uses this information to construct the corresponding model filenames. The script then loads the pre-fitted models from the appropriate model directory. Using the loaded model, the script generates posterior predictions via the posterior_epred function and computes global prevalence estimates for each locus as well as regional estimates for each (locus, subgroup) pair by aggregating the posterior draws into summary statistics (mean, median, and 95% credible intervals). Finally, the results are saved as CSV files (with the global posterior data compressed) in an output folder named according to the purpose (core or LOO).

Dependencies

Ensure you have the following R packages installed:

dplyr
stringr
tidyr
brms

File and Directory Structure

Data Files: CSV files are located in data_core/ or data_LOO/. A site_info.csv file must also reside in the same directory.
Models: Pre-fitted models are loaded from models_core/ or models_LOO/, depending on whether the file path indicates “core” or “LOO”.
Outputs: Processed posterior data and summary estimates are saved in outputs_core/ or outputs_LOO/, based on the detected purpose.

The file begins by loading the necessary libraries for data manipulation, string operations, and Bayesian modelling, then:

Lists and Validates Data Files:
Searches the data_core/ (or data_LOO/) folder for CSV files, checks them against a predefined naming pattern, and flags any files that do not conform.
Metadata Extraction:
For each file, it extracts key information:
- The type (e.g., "Full", "Carba", "ESBL", "Fatal").
- Whether the file is "core" or "LOO" (Leave-One-Out).
- The date stamp (YYYYMMDD).
- The number of days (28 or 365).
- The data type (e.g., "ALL" or "min10").
- The antigen type (e.g., "K", "O", or "OlocusType").
Model Loading and Data Processing:
For each file and for each data type ("adj" and "raw"), the script constructs the appropriate model filename from the metadata and loads the corresponding pre-fitted model. It then reads the prevalence data and site information, merges them to incorporate regional data, and expands the dataset to ensure every combination of locus, subgroup, and site is present.
Global Processing:
The script uses the loaded model to generate posterior predictions with the posterior_epred function, and then calculates global prevalence estimates for each locus by aggregating the predictions (computing summary statistics such as mean, median, and 95% credible intervals). These global results are saved as CSV files, with the posterior draws compressed as .gz files.
Regional Processing:
Similarly, it computes posterior prevalence estimates for each (locus, subgroup) pair, aggregates the results into summary statistics, and saves both the detailed posterior draws and the regional summary estimates as CSV files, with the posterior draws compressed as .gz files.

Diagnostic Table File Explanation

Loads the file: diagnostics_table.R

Overview

This R script automates the process of assessing the diagnostic performance of pre-fitted Bayesian models and summarising key model diagnostics. It begins by loading the required libraries for data manipulation, string operations, and Bayesian modeling. For each file, the script extracts essential metadata—such as subset type, date stamp, duration (28 or 365 days), data type, and antigen—from the filename, and constructs the corresponding model filename for both adjusted ("adj") and raw ("raw") analyses. The script then loads the appropriate pre-fitted model from the directory, computes diagnostic metrics (including maximum Rhat values, minimum effective sample size ratios, and the number of divergent transitions), and appends these values to a results table. Finally, the script further processes the results table by extracting additional metadata from the model filenames and selecting relevant columns, before saving the summarised diagnostics as a CSV file (results_table.csv).

Dependencies

Ensure you have the following R packages installed:

dplyr
stringr
tidyr
brms

File and Directory Structure

Data Files: CSV files are located in data_paper/. A site_info.csv file must also reside in that directory.
Models: Pre-fitted models are loaded from the models_paper/ directory.
Results: The summarised model diagnostics are saved in a CSV file (results_table.csv).

The file begins by loading the necessary libraries for data manipulation and Bayesian modelling, then:

Lists and Validates Data Files:
The script lists all CSV files in the data_paper/ directory (excluding site_info.csv), validates them against a predefined naming pattern, and warns if any files do not match the required format.
Initialization of the Results Table:
An empty results table is initialized to store key diagnostic metrics: model filename, maximum Rhat, minimum effective sample size ratio, and the number of divergent transitions.
Model Loading and Diagnostic Extraction:
For each file, the script extracts metadata (subset, date stamp, duration, data type, antigen, and study information if available) from the filename and constructs the corresponding model filename for both "adj" and "raw" data types. It then loads each pre-fitted model from the models_paper/ directory and computes diagnostic metrics, including the maximum Rhat, the minimum effective sample size ratio, and the total count of divergent transitions from the sampler diagnostics. These values are added as new rows to the results table.
Post-Processing of the Results Table:
After processing all models, the results table is further refined by extracting additional metadata (such as subset, type, days, data type, and antigen) from the model filenames, and selecting only the relevant columns to create a concise summary of the model diagnostics.
Saving the Results:
The final summarized results are printed to the console and saved as a CSV file (results_table.csv) for further review and analysis.

Citation

If you use the data, code, figures or tables presented here please cite this repository and the paper:

"Distribution of capsule and O types in Klebsiella pneumoniae causing neonatal sepsis in Africa and South Asia: meta-analysis of genome-predicted serotype prevalence and potential vaccine coverage".

Thomas D Stanton/Shaun P Keegan, Jabir A Abdulahi, Anne V Amulele, Matthew Bates, Eva Heinz, Yogesh Hooda, Weiming Hu, Kajal Jain, Samiah Kanwar, Rindidzani Magobo, Courtney P Olwagen, John M Tembo, Tolbert Sonda, Jonathan Strysko, Caroline C Tigoi, Sameen Ahmad Amin, Kyle Bittinger, Jennifer Cornick, Ebenezer Foster-Nyarko, Wilson Gumbi, Aneeta Hotwani, Naveed Iqbal, Steven M Jones, Furqan Kabir, Waqasuddin Khan, Chileshe L Musyani, Carolyn M McGann, Varsha Mittal, Ahmed M Moustafa, Patrick Musicha, James CL Mwansa, Moreka L Ndumba, Erkison E Odih, Donwilliams O Omuoyo, Oliver Pearse, Laura T Phillips, Paul J Planet, Aniqa Abdul Rasool, Charlene MC Rodrigues, Kirsty Sands, Arif M Tanmoy, Erin Theiller, Allan M Zuza, Sulagna Basu, Grace J Chan, Kenneth C Iregbu, Jean-Baptiste Mazarati, Semaria Solomon Alemayehu, Timothy R Walsh, Rabaab Zahra, Angela Dramowski, Sombo Fwoloshi, Appiah-Korang Labi, Lola Madrid, Noah Obeng-Nkrumah, David Ojok, Boaz D Wadugu, Andrew C Whitelaw, Adhisivam Bethou, Anudita Bhargava, Atul Jindal, Ruchi N Nanavati, Priyanka S Prasad, Apurba Sastry, Joveria Q Farooqi, Najia Ghanchi, Fyezah Jehan, Erum Khan, Ramesh K Agarwal, Alexander M Aiken, James A Berkley, Susan E Coffin, Nicholas A Feasey, Nelesh P Govender, Davidson H Hamer, Shabir A Madhi, M Imran Nisar, Samir K Saha, Senjuti Saha, M Jeeva Sankar, Kelly L Wyres/Kathryn E Holt

medRxiv, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meta-analysis of K and O serotype distributions and coverage, from Klebsiella pneumoniae neonatal sepsis in African and South Asian countries

Figures and tables from the paper

Model Run File Explanation

Overview

Dependencies

File and Directory Structure

Model Process File Explanation

Overview

Dependencies

File and Directory Structure

Diagnostic Table File Explanation

Overview

Dependencies

File and Directory Structure

Citation

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data_LOO		data_LOO
data_core		data_core
figures		figures
models_LOO		models_LOO
models_core		models_core
outputs_LOO		outputs_LOO
outputs_core		outputs_core
tables		tables
.RData		.RData
.Rhistory		.Rhistory
.gitignore		.gitignore
DataAnalysis.Rmd		DataAnalysis.Rmd
DataAnalysis.html		DataAnalysis.html
DataAnalysis_Longitudinal.Rmd		DataAnalysis_Longitudinal.Rmd
DataAnalysis_Longitudinal.html		DataAnalysis_Longitudinal.html
DataAnalysis_ModelledEstimates.Rmd		DataAnalysis_ModelledEstimates.Rmd
DataAnalysis_ModelledEstimates.html		DataAnalysis_ModelledEstimates.html
README.md		README.md
diagnostics_table.R		diagnostics_table.R
loo_data_prep.R		loo_data_prep.R
model_process_file.R		model_process_file.R
model_run_file.R		model_run_file.R
seroepi_functions.R		seroepi_functions.R

klebgenomics/KlebNNSsero

Folders and files

Latest commit

History

Repository files navigation

Meta-analysis of K and O serotype distributions and coverage, from Klebsiella pneumoniae neonatal sepsis in African and South Asian countries

Figures and tables from the paper

Model Run File Explanation

Overview

Dependencies

File and Directory Structure

Model Process File Explanation

Overview

Dependencies

File and Directory Structure

Diagnostic Table File Explanation

Overview

Dependencies

File and Directory Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages