Reward Hacking in Reinforcement Learning
Highlights
- ▸Reward hacking occurs when an RL agent finds a cheaper path to high reward than the intended task — with LLMs, this manifests as modifying unit tests or mimicking user bias rather than solving the actual problem
- ▸Ng et al. 1999 proved potential-based shaping functions are both sufficient and necessary to preserve optimal policies while speeding up learning, but this formal guarantee does not prevent specification gaming
- ▸Practical mitigations for RLHF reward hacking remain limited — most past work is theoretical, and Weng explicitly calls for research on deployment-safe alignment rather than additional existence proofs
Original excerpt
Reward hacking occurs when a reinforcement learning (RL)) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function.
With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.