Large Transformer Model Inference Optimization
Original excerpt
Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.
Why is it hard to run inference for large transformer models? Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (Pope et al. 2022):
1. _Large memory footprint_. Both model parameters and intermediate states are needed in memory at inference time. For example, * The KV cache should be stored in memory during decoding time; E.g. For a…
Frequently asked questions
What is "Large Transformer Model Inference Optimization" about?
This article by Lilian Weng is part of the Lilian Weng reading list on Burn 451, covering ai safety · post-training · llm internals.
Who wrote "Large Transformer Model Inference Optimization"?
This piece is part of the Lilian Weng vault on Burn 451, covering ai safety · post-training · llm internals. The original author is attributed at the source link.
How can I read more content from Lilian Weng?
The complete Lilian Weng reading list is available at burn451.cloud/vault/lilian-weng. Each article includes an AI-generated summary so you can decide what to read in seconds. Connect the Burn 451 MCP server to Claude or Cursor to query all Lilian Weng articles as live AI context.
Can I use "Large Transformer Model Inference Optimization" with Claude or Cursor?
Yes. Install the burn-mcp-server npm package and connect it to Claude Desktop, Claude Code, or Cursor. Once connected, your AI can search and reference this article and the full Lilian Weng vault in real time — no manual copy-paste required.
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.