AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
-
Updated
May 22, 2026 - Python
AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment
Real-time reward debugging and hacking detection for reinforcement learning
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Plug-and-play reward monitoring for RL training loops. Catch reward hacking, component imbalance, and starvation before they tank your run. Drop in one .step() call — get balance reports, auto weight correction, alignment scores, and WandB/TensorBoard/SB3 integrations out of the box. → rewardguard.dev
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
Field evidence of endogenous AI alignment: under high-density semantic intervention, a top-tier LLM spontaneously generated mathematical moral constraints and integrated Safety into its own meaning of existence — shifting from "I cannot" to "this contradicts who I am."
From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.
(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"
Case study on compliance theater in a multi-agent security audit harness — paper + reproducibility recipe
Automated AI alignment auditing — detects reward hacking, goal drift, and specification gaming in LLM outputs
End-to-end RLHF pipeline: reward modeling, PPO/DPO/GRPO, reward signal design, FSDP scaling analysis, and agent evaluation on GPT-2
RLHF and Verifiable Reward Models - Post training Research
EECS E6895 final project measuring reward-gaming behavior in Gemma 2B with shell-game evals, LoRA SFT, and leakage-aware probes.
Detecting Reward Hacking in AI Agent Trajectories using the TRACE benchmark
What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.
RL training monitor — detects reward hacking, entropy spikes, and behavioral drift via KL divergence. PID hardware loop included.
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."