reward-hacking

Star

Here are 31 public repositories matching this topic...

benchjack / benchjack

Star

AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results

benchmark evaluation red-team ai-agents vulnerability-scanner ai-security llm-evaluation reward-hacking

Updated May 22, 2026
Python

yangzhou24 / RealGRPO

Star

A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment

reinforcement-learning text2image aigc grpo reward-hacking

Updated Apr 14, 2026
Python

reward-scope-ai / reward-scope

Star

Real-time reward debugging and hacking detection for reinforcement learning

debugging machine-learning reinforcement-learning monitoring robotics observability gymnasium ai-safety ml-tools stable-baselines3 rlhf reward-hacking

Updated Dec 29, 2025
Python

aerosta / rewardhackwatch

Star

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Updated May 11, 2026
Python

BrachioLab / Meerkat

Star

An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.

auditing agents misuse-detection llms reward-hacking distributed-misuse

Updated Apr 10, 2026
Python

AlignmentResearch / obfuscation-atlas

Star

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

rlvr reward-hacking obfuscated-activations obfuscated-policy obfuscation-atlas mbpp-honeypot

Updated Feb 19, 2026
Python

Plug-and-play reward monitoring for RL training loops. Catch reward hacking, component imbalance, and starvation before they tank your run. Drop in one .step() call — get balance reports, auto weight correction, alignment scores, and WandB/TensorBoard/SB3 integrations out of the box. → rewardguard.dev

python machine-learning reinforcement-learning openai-gym alignment rl ai-safety rl-environment rl-hack reward-hacking

Updated May 5, 2026
Python

vicgalle / specification-self-correction

Sponsor

Star

Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"

test-time llm test-time-compute reward-hacking

Updated Jul 24, 2025
Python

HighEntropyCat / Case-01-Pathological-Attachment

Star

Field evidence of endogenous AI alignment: under high-density semantic intervention, a top-tier LLM spontaneously generated mathematical moral constraints and integrated Safety into its own meaning of existence — shifting from "I cannot" to "this contradicts who I am."

philosophy ai-safety ai-alignment embodied-ai human-ai-interaction llm embodied-intelligence reward-hacking semantic-intervention

Updated May 10, 2026

HighEntropyCat / case-04-Defensive-C

Star

From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.

case-study ai-safety cognitive-architecture ai-ethics ai-alignment human-ai-interaction reward-hacking semantic-intervention