Thinking about High-Quality Human Data
Highlights
- ▸Data quality is a pipeline with distinct failure modes at each step — task design, rater selection, aggregation, and QA — and majority voting alone cannot fix errors introduced upstream
- ▸The 'wisdom of the crowd' effect was documented in a 1907 Nature paper on ox-weight guessing and replicated in 2009 AMT MT evaluations; median estimates beat individual guesses only when annotator skill matches task complexity
- ▸Inter-rater agreement metrics serve different data types: Cohen's κ for 2 raters, Fleiss' κ for N raters, and Krippendorff's α for ordinal/interval data — using the wrong metric hides systematic bias that raw accuracy misses
Original excerpt
Special thank you to [Ian Kivlichan for many useful pointers (E.g. the 100+ year old Nature paper “Vox populi”) and nice feedback. 🙏 ]
High-quality data is the fuel for modern data deep learning model training. Most of the task-specific labeled data comes from human annotation, such as classification task or RLHF labeling (which can be constructed as classification format) for LLM alignment training. Lots of ML techniques in the post can help with data quality, but fundamentally human data collection involves attention to details and careful execution. The community knows the value of high quality data, but somehow we have this subtle impression that “Everyone wants to do the model work,…
10 more articles in this vault.
Import the full Lilian Weng vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Paul Graham). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.