Large Transformer Model Inference Optimization
Highlights
- ▸KV cache for batch 512 + context 2048 = 3TB, 3× model size — memory footprint is the first wall before parameter count
- ▸Five method families target different scarce resources: parallelism (throughput), offloading (memory), smart batching (throughput), compression (memory + speed), attention architecture (latency)
- ▸Knowledge distillation transfers a teacher model's skill distribution to a smaller student via softened probability targets — not hard labels — per Hinton et al. 2015
Original excerpt
Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.
Why is it hard to run inference for large transformer models? Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (Pope et al. 2022):
1. _Large memory footprint_. Both model parameters and intermediate states are needed in memory at inference time. For example, * The KV cache should be stored in memory during decoding time; E.g. For a…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.