If you use this pipeline or its data, please cite as follows:
Carvalho Brom, P., & dos Santos, P. H. (2025). pcbrom/MMBN-CAS: v1.1 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.17603492
@software{carvalho_brom_2025_17603492,
author = {Carvalho Brom, Pedro and
dos Santos, Paulo Henrique},
title = {pcbrom/MMBN-CAS: v1.1},
month = nov,
year = 2025,
publisher = {Zenodo},
version = {v1.1},
doi = {10.5281/zenodo.17603492},
url = {https://doi.org/10.5281/zenodo.17603492},
swhid = {swh:1:dir:339dd11e50fa3653469351dc4b7dad1706bba71c
;origin=https://doi.org/10.5281/zenodo.17603491;vi
sit=swh:1:snp:60f15d9eb83b37976d6635ac020287789360
7b0f;anchor=swh:1:rel:d61db5e90059bfc3488f3b113b06
5041903f781a;path=pcbrom-MMBN-CAS-1d4b85d
},
}
End-to-end pipeline for simulation, psychometric analysis, and equating of a 22-item ordinal instrument (Likert 1–5). This repository includes:
- Synthetic response generation with an LLM for maturity profiles (novice, intermediate, advanced).
- Reliability, dimensionality (PA/EFA/CFA), and IRT calibration (GRM) in R.
- Equating and classification by cut scores, plus complementary analyses in Python.
Flow overview
-
simulador.ipynb: builds a response agenda and calls the API (OpenAI) to generate item/persona/profile responses, savingoutputs/mmbncas_llm_raw.jsonl.
-
analise_simulacao.R: reads JSONL, computes reliability (α, ω), runs PA/EFA/CFA, calibrates a GRM (mirt), and exports parameters and scores:outputs/grm_item_parameters_mmab_ncas.csv,outputs/mmbncas_llm_with_theta.csv, plus figures.
-
equalizacao.ipynb: re-estimatestheta_hatvia MLE from GRM parameters, applies cut points[-0.20, 0.40]for profiles (novice/intermediate/advanced), and generates diagnostics/plots (Wright map, standardized residuals, categorical divergence, PCA/clustering). It may produceoutputs/mmbncas_llm_with_theta_scored.csv.
Items and dimensions (used in the analyses)
- Items:
q1…q22, responses in 1–5 (ordinal). - Hypothesized dimensions:
- Governance & Strategy:
q1–q6 - Operational Integration:
q7–q12 - Sustainability & Scalability:
q13–q22
- Governance & Strategy:
- Profiles and cut points for classification:
CUTS = [-0.20, 0.40],LABELS = ["novice", "intermediate", "advanced"].
Requirements
- Python 3.10+ and R 4.2+ (recommended)
- Python (core in
requirements.txt):python-dotenv,pandas,tqdm,openai,rpy2,numpy,tenacity- Extra packages used in notebooks:
matplotlib,seaborn,scipy,scikit-learn,factor_analyzer,jupyter
- R packages:
tidyverse,psych,GPArotation,lavaan,mirt
Quick setup
- Python
- Create a virtual environment and install the core dependencies:
python -m venv .venv && source .venv/bin/activate(Linux/Mac)python -m venv .venv && .venv\Scripts\activate(Windows)pip install -r requirements.txt
- For notebooks:
pip install jupyter matplotlib seaborn scipy scikit-learn factor_analyzer
- Create a virtual environment and install the core dependencies:
- R
- From R/RStudio:
install.packages(c("tidyverse","psych","GPArotation","lavaan","mirt"))
- From R/RStudio:
- Credentials (to run the LLM simulator)
- Create a
.envfile with:OPENAI_API_KEY=your_token_here
- Create a
How to run
-
Option A — Reproduce with existing artifacts (no API costs):
- Skip
simulador.ipynband use the existingoutputs/mmbncas_llm_raw.jsonl. - Run
analise_simulacao.R(RStudio orRscript analise_simulacao.R). - Open and run
equalizacao.ipynb(ensurePARAM_PATHandRESP_PATHpoint to the files inoutputs/).
- Skip
-
Option B — Full pipeline (incurs API costs):
simulador.ipynb: configure.env, execute cells to generateoutputs/mmbncas_llm_raw.jsonl. Defaults: 20 replicas × 3 profiles × 60 respondents; personas and per-profile theta distributions are defined in the notebook.analise_simulacao.R: runs reliability (α, ω), PA/EFA (Spearman, ML+Promax), CFA (DWLS withlavaan), and IRT GRM (mirt). Exports:outputs/grm_item_parameters_mmab_ncas.csv(a, b1..b4)outputs/mmbncas_llm_with_theta.csv(consolidated dataset with items and EAP-estimatedtheta)- Figures: item/test information, ICC grid, etc.
equalizacao.ipynb: reads GRM parameters and responses, estimatestheta_hat(MLE), computes SE via information, classifies by cut points, and generates:outputs/wright_map.png,outputs/standardized_residuals.png,outputs/categorical_divergence.png,outputs/pca_clustering_analysis.png- A dataset with
theta_hatand predicted profile (e.g.,outputs/mmbncas_llm_with_theta_scored.csv)
Important notes
- Cost/time:
simulador.ipynbmakes OpenAI API calls. Use the precomputed artifacts inoutputs/to avoid costs. - Column
theta: in the original JSONL,thetais the imposed/drawn latent trait; in the R pipeline, the consolidated file stores the estimatedtheta(EAP). The equating notebook treatstheta, when present, as imposed/reference for evaluation (RMSE, correlation). Confirm semantics before comparing estimates. - Parallelization:
equalizacao.ipynbusesProcessPoolExecutorto speed up batchtheta_hatestimation.
Repository structure (key files)
simulador.ipynb— LLM-based simulation and response collection.analise_simulacao.R— R pipeline: parsing, reliability, PA/EFA/CFA, GRM.analise_simulacao.ipynb— complementary analyses in Python (α/ω, ordinal EFA, R bridge viarpy2).equalizacao.ipynb— equating/MLE estimation, diagnostics, and visualizations.outputs/— generated artifacts (JSONL, CSVs, figures). There is also a copy ofmmbncas_llm_with_theta_scored.csvat the repository root.material/— notes and supporting materials (material/nota.txtlists Python packages and writing notes).MMBN-CAS.Rproj— RStudio project file.LICENSE— repository license.
Expected results/artifacts (examples)
outputs/mmbncas_llm_raw.jsonl— responses per respondent (uuid), profile, persona,theta, and JSON blob withq1..q22.outputs/mmbncas_llm_with_theta.csv— consolidated dataset with items and GRM EAPtheta.outputs/grm_item_parameters_mmab_ncas.csv— IRT GRM parameters (a, b1..b4) by item.- Figures:
grid_icc_grm.png,item_information.png,test_information.png,wright_map.png,standardized_residuals.png,categorical_divergence.png,pca_clustering_analysis.png.
License
- See
LICENSEfor terms of use and redistribution.