The Transformer Family Version 2.0

BlogPaul GrahamJun 14, 2023

Highlights

  • β–ΈWeng's 2023 update is roughly twice the length of her 2020 original β€” a comprehensive refactoring that restructures the hierarchy of Transformer variants rather than appending new papers
  • β–ΈThe post derives the scaled dot-product attention formula attn(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V and establishes notation (d, h, L, N) used consistently across all variants
  • β–ΈSelf-attention is fundamentally a permutation-invariant set operation β€” understanding this property explains why encoder-only (BERT) and decoder-only (GPT) simplifications work for bidirectional context vs autoregressive generation

Original excerpt

Lil'Log | Posts Archive Search Tags FAQ The Transformer Family Version 2.0 Date: January 27, 2023 | Estimated Reading Time: 45 min | Author: Lilian Weng Table of Contents

Many new Transformer architecture improvements have been proposed since my last post on β€œThe Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post β€” restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.

𝑋 ∈ 𝑅 𝐿 Γ— 𝑑 The input sequence where each element has been mapped into an embedding vector of shape 𝑑 , same as the model size.

π‘Š 𝑣 ∈ 𝑅 𝑑 Γ— 𝑑 𝑣 The value weight…

10 more articles in this vault.

Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.

Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.