SWE-bench Technical Report (Devin)
BlogCognition AIJun 14, 2024
AI Summary
Cognition's technical write-up of how Devin reaches 13.86% on SWE-bench โ far above the previous 1.96% unassisted baseline. Documents the standardized prompt protocol, the deterministic unit-test evaluation, and the surprising result that 72% of passing solutions take >10 minutes, suggesting iteration depth (not raw model capability) is the dominant factor. The benchmark report that pushed the field toward iteration-budget as a first-class harness metric.
31 more articles in this vault.
Import the full Agent Harnesses vault to Burn 451 and build your own knowledge base.
Content attributed to the original author (Cognition AI). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.