Reward Hacking in Reinforcement Learning

BlogApr 22, 2026

Original excerpt

Reward hacking occurs when a reinforcement learning (RL)) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function.

With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a…

Read full article on lilianweng.github.io

Frequently asked questions

What is "Reward Hacking in Reinforcement Learning" about?

This article by Lilian Weng is part of the Lilian Weng reading list on Burn 451, covering ai safety · post-training · llm internals.

Who wrote "Reward Hacking in Reinforcement Learning"?

This piece is part of the Lilian Weng vault on Burn 451, covering ai safety · post-training · llm internals. The original author is attributed at the source link.

How can I read more content from Lilian Weng?

The complete Lilian Weng reading list is available at burn451.cloud/vault/lilian-weng. Each article includes an AI-generated summary so you can decide what to read in seconds. Connect the Burn 451 MCP server to Claude or Cursor to query all Lilian Weng articles as live AI context.

Can I use "Reward Hacking in Reinforcement Learning" with Claude or Cursor?

Yes. Install the burn-mcp-server npm package and connect it to Claude Desktop, Claude Code, or Cursor. Once connected, your AI can search and reference this article and the full Lilian Weng vault in real time — no manual copy-paste required.

10 more articles in this vault.

Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.

View Full Vault Get Burn 451

Content attributed to the original author. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.