Prompt Engineering
Highlights
- ▸Few-shot performance variance is structural, not random — Zhao et al. 2021 isolate three biases in GPT-3: majority label, recency, and common token bias, each producing systematic output skews
- ▸Weng's critique: many prompt engineering papers stretch explainable-in-sentences tricks into 8-page benchmarking exercises; the community needs shared benchmark infrastructure more than incremental demonstrations
- ▸Treat prompt construction as controlled experiment, not craft — vary example selection, ordering, and label balance one at a time with held-out measurement rather than optimizing a single working prompt
Original excerpt
Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes _without_ updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models. At its core, the goal of prompt engineering is about alignment and model steerability. Check my previous post on controllable text generation.
[My personal spicy take] In my opinion, some prompt engineering…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.