SWE-bench Technical Report (Devin)

BlogCognition AIJun 14, 2024

AI Summary

Cognition's technical write-up of how Devin reaches 13.86% on SWE-bench — far above the previous 1.96% unassisted baseline. Documents the standardized prompt protocol, the deterministic unit-test evaluation, and the surprising result that 72% of passing solutions take >10 minutes, suggesting iteration depth (not raw model capability) is the dominant factor. The benchmark report that pushed the field toward iteration-budget as a first-class harness metric.

Read full article on cognition.ai

31 more articles in this vault.

Import the full Agent Harnesses vault to Burn 451 and build your own knowledge base.

View Full Vault Get Burn 451

Content attributed to the original author (Cognition AI). Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.