A powerful, config-driven framework for building physics analyses with ROOT RDataFrame.
RDFAnalyzerCore provides a core, analysis-agnostic framework for constructing and running Analyzer pipelines using ROOT RDataFrame. The framework features:
- Config-Driven Architecture: Separate configuration from code for reproducibility
- Plugin System: Extensible design with BDT, ONNX, correction, and histogram managers
- Lazy Evaluation: Leverages RDataFrame's efficient event processing
- Systematic Support: Built-in handling of systematic variations
- Analysis Modularity: Analyses live in separate repositories, automatically discovered at build time
- Python Bindings: Use the framework from Python with numba and numpy integration
- Statistical Analysis: Optional CMS Combine integration for limit setting and fits
Books and fills N-dimensional histograms with support for systematics, regions, and categories. Supports both manual histogram booking and config-driven histogram definitions.
Config-Driven Histograms: Define histograms in a configuration file for dynamic runtime booking. See docs/CONFIG_HISTOGRAMS.md for detailed documentation.
Quick example:
// Enable histogram manager
auto histManager = std::make_unique<NDHistogramManager>(analyzer.getConfigurationProvider());
analyzer.addPlugin("histogramManager", std::move(histManager));
// Define variables and apply filters
analyzer.Define("jet_pt", computePt, {"jet_px", "jet_py"});
analyzer.Filter("quality", isGood, {"jet_quality"});
// Book histograms from config file (after all defines/filters)
analyzer.bookConfigHistograms();
// Save results
analyzer.save();Config file format (histograms.txt):
name=pt_hist variable=jet_pt weight=event_weight bins=50 lowerBound=0.0 upperBound=500.0
Clone the repository:
git clone [email protected]:brkronheim/RDFAnalyzerCore.git
cd RDFAnalyzerCore
source env.sh # On a CVMFS-backed host
source build.sh
# Run example
cd build/analyses/ExampleAnalysis
./analysis cfg.yaml
New to the framework? Check out the Getting Started Guide.
- Architecture - Core manager wiring, plugin lifecycle, and execution flow
- API Reference - Canonical
Analyzerand interface APIs - Plugin Development - Implementing C++ plugins via
IPluggableManager - Doxygen Guide - C++ documentation standards for headers and interfaces
- Getting Started - Installation and first steps
- Configuration Reference - Complete config file documentation
- Analysis Guide - Building analyses step-by-step
- Python Bindings - Using the framework from Python
- API Reference - Detailed API documentation
- Datacard Generator - Creating CMS combine datacards
- Systematics Example - Creating histograms with systematic variations
- Combine Integration - Complete workflow from analysis to statistical inference
- Architecture - Internal design and C++ wiring structure
- Plugin Development - Creating custom plugins
- ONNX Implementation - ONNX manager details
- ONNX Multi-Output - Multi-output model support
If you are reading docs for a specific role, start here:
- Developers (framework contributors):
docs/ARCHITECTURE.md,docs/API_REFERENCE.md,docs/PLUGIN_DEVELOPMENT.md,docs/DOXYGEN_GUIDE.md - Analyzers (analysis authors):
docs/GETTING_STARTED.md,docs/ANALYSIS_GUIDE.md,docs/CONFIG_REFERENCE.md,docs/CONFIG_HISTOGRAMS.md,docs/NUISANCE_GROUPS.md - Agents/automation tooling:
docs/INDEX.md,docs/ERRORS_AND_TRACING.md,docs/CONFIGURATION_VALIDATION.md,docs/OUTPUT_SCHEMA.md,docs/VALIDATION_REPORTS.md
The docs are intentionally layered: GETTING_STARTED and ANALYSIS_GUIDE show workflow, while API_REFERENCE and headers in core/interface/ are the source of truth for signatures and behavior.
- ROOT 6.30/02 or later (progress bar support requires 6.30+)
- CMake 3.19.0 or later
- C++17 compatible compiler
- Git
For Python bindings (optional):
- Python 3.8+
- pybind11, numpy, numba (install with
pip install pybind11 numpy numba)
For LAW / Luigi workflows:
- Create and activate the repository-local virtualenv, then install the production requirements:
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements-production.txt - After that,
source law/env.shwill reuse.venvautomatically when it exists. xrdfsmust be available onPATHfor XRootD file discovery workflows.
Self-hosted CI runner Dockerfile
- A ready-to-build runner image including ROOT, Python,
numpyandnumbais provided atdocker/gh-runner.Dockerfile. - See
docs/CI_DOCKERFILE.mdfor build/run instructions and details.
RDFAnalyzerCore/
├── core/ # Framework code
│ ├── interface/ # Public headers and interfaces
│ ├── src/ # Core implementations
│ ├── plugins/ # Plugin managers (BDT, ONNX, etc.)
│ ├── python/ # HTCondor submission scripts
│ └── tests/ # Core test targets
├── analyses/ # Analysis repositories (git submodules/clones)
├── examples/ # Python binding examples
├── docs/ # Documentation
├── cmake/ # CMake modules
└── build/ # Build artifacts (generated)
The framework includes several built-in plugins for common analysis tasks:
Manages Boosted Decision Trees using the FastForest library.
- Load BDT models from text files
- Apply models with sigmoid activation
- Conditional execution for efficiency
Manages ONNX machine learning models from any ML framework.
- Automatic ONNX Runtime setup (no manual installation)
- Support for multi-output models (e.g., ParticleTransformer)
- Thread-safe inference with ROOT ImplicitMT
- See: ONNX Implementation Guide
Manages SOFIE (System for Optimized Fast Inference code Emit) models from ROOT TMVA.
- Build-time compilation from ONNX for maximum performance
- Zero runtime overhead (compiled C++ code)
- Eliminates runtime model loading overhead compared to ONNX Runtime
- Manual registration required (rebuild for model updates)
- See: SOFIE Implementation Guide
Applies scale factors and corrections using correctionlib.
- JSON-based correction definitions
- Automatic application of configured corrections
- Support for multi-dimensional lookups
Handles trigger logic and trigger menu configuration.
- Configurable trigger groups with OR logic
- Trigger veto support
- Sample-specific trigger configurations
Books and fills N-dimensional histograms.
- Support for systematics, regions, and categories
- Automatic systematic axis generation
- Vector-based filling with scalar expansion
All framework behavior is controlled through text configuration files:
- Main configuration: I/O, performance, plugin configs
- Plugin configs: Model definitions, corrections, triggers
- Output configs: Branch selection, histogram definitions
- Analysis-local registries can also live in YAML when they are primarily data, such as the VHqq run-era payload and trigger map in
analyses/VHbbcc/VHqqRDF/cfg/year_settings.yaml
Example:
# Main config
fileList=data.root
saveFile=output.root
threads=-1
bdtConfig=cfg/bdts.txt
onnxConfig=cfg/onnx_models.txt
See Configuration Reference for complete documentation.
# Clone repository
git clone [email protected]:brkronheim/RDFAnalyzerCore.git
cd RDFAnalyzerCore
# Setup environment
source env.sh
# Build
source build.shEnsure ROOT and CMake are available:
# Setup ROOT
source <root-install>/bin/thisroot.sh
# Build
cmake -S . -B build
cmake --build build -j$(nproc)The framework supports optional features that can be enabled at build time:
# Build with all features (default: tests enabled, Combine disabled)
cmake -S . -B build
# Disable tests (faster build for production)
cmake -S . -B build -DBUILD_TESTS=OFF
# Enable CMS Combine for statistical analysis
cmake -S . -B build -DBUILD_COMBINE=ON
# Enable both Combine and CombineHarvester
cmake -S . -B build \
-DBUILD_COMBINE=ON \
-DBUILD_COMBINE_HARVESTER=ON
# Complete build with all options
cmake -S . -B build \
-DBUILD_TESTS=ON \
-DBUILD_COMBINE=ON \
-DBUILD_COMBINE_HARVESTER=ON
cmake --build build -j$(nproc)Available Options:
BUILD_TESTS(default:ON) - Build analysis testsBUILD_COMBINE(default:OFF) - Build CMS Combine packageBUILD_COMBINE_HARVESTER(default:OFF) - Build CombineHarvester (requiresBUILD_COMBINE=ON)
Note: Building Combine and CombineHarvester takes several minutes and requires an internet connection.
See Combine Integration Guide for complete statistical analysis workflows.
Create and activate a local Python virtual environment before running Python-focused tests. Use the same interpreter for both CTest and direct pytest runs when possible.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-ci.txt pytestThen run the full C++/Python suite through CTest:
cd build
ctest --output-on-failureOr run targeted suites directly from the canonical Python test directories:
source .venv/bin/activate
PYTHONPATH=core/python:core/python/law python -m pytest core/tests/python/ core/tests/law/ -qThe C++ suite and Python binding smoke test now live under core/tests/cpp/.
If you rebuild or reconfigure after creating the virtual environment, prefer:
cmake -S . -B build -DPython3_EXECUTABLE=.venv/bin/pythonFor a fast sanity check after edits, run cd build && ctest --output-on-failure and then source .venv/bin/activate && PYTHONPATH=core/python:core/python/law python -m pytest core/tests/python/ core/tests/law/ -q.
Analyses are developed in separate repositories and automatically discovered during build:
# Clone your analysis into analyses/
cd analyses
git clone <your-analysis-repo> MyAnalysis
cd ..
# Rebuild - your analysis is automatically found
source build.sh
# Run
cd build/analyses/MyAnalysis
./myanalysis config.txtRequirements for analysis repositories:
- Must contain a
CMakeLists.txtat the root - Should link against
RDFCorelibrary - Configuration files typically in
cfg/subdirectory
See Analysis Guide for step-by-step instructions.
The framework is built around several key components that work together:
- Analyzer: Central orchestrator providing a simplified API
- ConfigurationManager: Loads and provides access to configuration
- DataManager: Wraps ROOT::RDataFrame with systematic support
- SystematicManager: Tracks and propagates systematic variations
- Plugins: Extensible managers for specific tasks (ML, corrections, histograms)
- Analysis Services: Internal service hooks (for example provenance and counters)
- OutputSinks: Abstract destinations for skims and metadata
The core is wired in C++ through interfaces and dependency-aware plugin ordering. Analyzer owns core managers and services, injects a shared ManagerContext, then executes plugin lifecycle hooks in a deterministic order.
Configuration Files
↓
ConfigurationManager → Plugins Load Configs
↓
DataManager builds TChain & RDataFrame
↓
User Code: Define Variables, Apply Filters
↓
Plugins: Apply Models, Corrections, Book Histograms
↓
RDataFrame Event Loop (Lazy Evaluation)
↓
Output Sinks: Write Skims & Metadata
Key Design Principles:
- Interface-based: Components depend on interfaces, not implementations
- Plugin architecture: Extensible without modifying core
- Config-driven: Behavior controlled by text files
- Lazy evaluation: Efficient processing via RDataFrame
See Architecture Documentation for detailed internals.
Here's a minimal analysis using the framework:
#include <analyzer.h>
int main(int argc, char **argv) {
// Create analyzer from config file
Analyzer analyzer(argv[1]);
// Define variables
analyzer.Define("good_jets",
[](const RVec<float>& pt, const RVec<float>& eta) {
return pt > 25.0 && abs(eta) < 2.5;
},
{"jet_pt", "jet_eta"}
);
analyzer.Define("n_good_jets",
[](const RVec<bool>& good) { return Sum(good); },
{"good_jets"}
);
// Apply selection
analyzer.Filter("jet_selection",
[](int n_jets) { return n_jets >= 4; },
{"n_good_jets"}
);
// Apply ML model (from config)
auto onnxMgr = analyzer.getPlugin<OnnxManager>("onnx");
if (onnxMgr) {
onnxMgr->applyAllModels();
}
// Save outputs
analyzer.run();
return 0;
}Configuration (config.txt):
fileList=data1.root,data2.root
saveFile=output.root
threads=-1
onnxConfig=cfg/onnx_models.txt
saveConfig=cfg/output_branches.txt
See Analysis Guide for complete examples.
The framework can also be used from Python with high performance:
import rdfanalyzer
# Create analyzer from config file
analyzer = rdfanalyzer.Analyzer("config.txt")
# Define variables using C++ expressions (ROOT JIT)
analyzer.Define("pt_gev", "pt / 1000.0", ["pt"])
analyzer.Define("delta_r",
"sqrt(delta_eta*delta_eta + delta_phi*delta_phi)",
["delta_eta", "delta_phi"])
# Or use numba-compiled functions
import numba, ctypes
@numba.cfunc("float64(float64)")
def convert_to_gev(pt):
return pt / 1000.0
func_ptr = ctypes.cast(convert_to_gev.address, ctypes.c_void_p).value
analyzer.DefineFromPointer("pt_gev", func_ptr, "double(double)", ["pt"])
# Apply filters and save
analyzer.Filter("high_pt", "pt_gev > 25.0", ["pt_gev"])
analyzer.save()Key Features:
- String-based expressions (ROOT JIT compilation)
- Numba function pointers for custom logic
- Numpy array integration
- Full systematic variation support
See Python Bindings Guide for complete documentation and examples.
The framework supports multiple ML backends:
- ONNX: Runtime evaluation of models from any framework (PyTorch, TensorFlow, scikit-learn)
- BDT: FastForest-based boosted decision trees
- SOFIE: Build-time compiled models for maximum performance
PlottingUtility can build compiled-ROOT stack plots directly from the meta output file. It supports:
- per-process normalization through optional counter histograms (for example
counter_weightSum_<sample>) - linear and log-y stack plots
- optional data/MC ratio panels
- ratio/error/pull summary computation
- PCA-based mean/up/down envelope construction from variation histograms
All managers support:
- Conditional execution (skip expensive inference when not needed)
- Multi-output models
- Thread-safe inference with ROOT ImplicitMT
Execution entry points:
save()always writes the configured skim output and finalizes plugins/services.run()conditionally writes skim output whenenableSkim=1|true|True, saves ND histograms (ifhistogramManageris registered), and finalizes plugins/services.
Built-in support for systematic variations:
sysMgr->registerSystematic("jes_up");
sysMgr->registerSystematic("jes_down");
analyzer.Define("corrected_pt",
[](float pt, const std::string& sys) {
if (sys == "jes_up") return pt * 1.02;
if (sys == "jes_down") return pt * 0.98;
return pt;
},
{"jet_pt"},
sysMgr
);Histograms automatically include systematic axes.
Framework includes a Python script for generating CMS combine datacards from analysis outputs:
# Install dependencies (uproot-based, no PyROOT required)
pip install uproot awkward numpy pyyaml
# Generate datacards
python core/python/create_datacards.py config.yamlFeatures:
- Pure Python: Uses uproot (no PyROOT dependency)
- YAML-based configuration for datacards
- Multiple control region support
- Sample combination (binned/stitched samples)
- Observable rebinning (uniform and variable)
- Systematic uncertainties (rate and shape)
- Automatic systematic variation reading from input files
- Full Combine and CombineHarvester integration
See:
- Datacard Generator Guide for complete documentation
- Systematics Example for creating histograms with systematic variations
- Combine Integration for complete statistical analysis workflow
Unified production management system for batch analyses.
Legacy compatibility submission scripts have been removed. New production workflows should use LAW discovery tasks to generate job configs and core/python/production_monitor.py / core/python/production_manager.py for submission, monitoring, validation, and resubmission.
Recommended workflow:
# Discover files via Rucio and build job configs
law run GetRucioFileList --submit-config analyses/myAnalysis/cfg/submit_config.txt --name myRun
law run SkimTask --submit-config analyses/myAnalysis/cfg/submit_config.txt --dataset-manifest analyses/myAnalysis/cfg/datasets.yaml --name mySkimRun --file-source rucio --file-source-name myRun --exe build/analyses/MyAnalysis/myanalysis
# Monitor and validate
python core/python/production_monitor.py monitor --name mySkimRun
python core/python/production_monitor.py validate --name mySkimRun
python core/python/production_monitor.py resubmit --name mySkimRunFeatures:
- Unified job lifecycle management (generation, submission, monitoring, validation)
- State persistence (resilient to connection failures)
- Real-time progress monitoring
- Automatic output validation
- Failure recovery and resubmission
- HTCondor and DASK backend support
- Works in AFS/EOS storage areas
See: Production Manager Guide for complete documentation.
Legacy batch submission scripts have been removed. Batch submission now uses LAW discovery tasks and the production manager toolchain.
Features:
- Rucio-based dataset discovery via LAW
- Open Data discovery via LAW tasks
- Automatic input/output staging
- XRootD support
- Shared executable staging
- Configuration validation
See: Batch Submission Guide for complete documentation.
Support for custom C++ objects:
cmake -S . -B build \
-DRDF_CUSTOM_DICT_HEADERS="MyEvent.h;MyObject.h" \
-DRDF_CUSTOM_DICT_LINKDEF="MyLinkDef.h" \
-DRDF_CUSTOM_DICT_INCLUDE_DIRS="/path/to/headers"Dictionaries are automatically built and linked.
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
For new plugins, see Plugin Development Guide.
- Documentation: Check the
docs/directory - Issues: Open an issue on GitHub
- Examples: See
analyses/ExampleAnalysis/
This project is licensed under the terms specified in the repository.
- Built on ROOT RDataFrame
- Uses ONNX Runtime for ML inference
- Corrections via correctionlib
- BDT support via FastForest
Full Documentation: https://brkronheim.github.io/RDFAnalyzerCore/