Why We Think
Highlights
- ▸Test-time compute and CoT let models switch from System 1 retrieval to System 2 serial reasoning — Kahneman's dual-process framework maps directly onto inference-time scaling
- ▸Transformer inference costs ~2 FLOPs per parameter per forward pass, but CoT breaks this fixed-cost ceiling by allowing variable-length reasoning chains that scale compute with problem difficulty
- ▸Inference-time scaling is a controllable accuracy lever independent of model size — when answers are wrong, check token budget before checking parameter count
Original excerpt
Special thanks to John Schulman for a lot of super valuable feedback and direct edits on this post.
Test time compute (Graves et al. 2016, Ling, et al. 2017, Cobbe et al. 2021) and Chain-of-thought (CoT) (Wei et al. 2022, Nye et al. 2021), have led to significant improvements in model performance, while raising many research questions. This post aims to review recent developments in how to effectively use test-time compute (i.e. “thinking time”) and why it helps.
The core idea is deeply connected to how humans think. We humans cannot immediately provide the answer for "What's 12345 times 56789?". Rather, it is natural to spend time pondering and analyzing before getting to the result,…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.