Simon Willison
LLM BlogSimon Willison's LLM blog — coined 'prompt injection' and 'agentic engineering patterns'. Hands-on frontier model reviews, Claude Code deep dives, and the Datasette ecosystem, from simonwillison.net.
“Simon Willison's LLM blog and AI tools reading list — curated posts on coding agents, datasette, and the evolving AI tooling landscape.”
About this vault
Simon Willison coined 'prompt injection' and 'agentic engineering' — his LLM blog is essential reading for AI engineers. Curated from simonwillison.net: frontier model benchmarks via his pelican-on-a-bicycle test, agentic engineering patterns, Claude Code case studies, and the Datasette ecosystem he maintains. Auto-synced daily. Each piece has an AI summary so you can triage in seconds.
20 articles
Agentic Engineering & Vibe Coding
How AI coding agents are reshaping software development. Simon Willison's 'Agentic Engineering Patterns' newsletter and case studies on vibe coding, prompt injection, and Claude Code workflows — from one of the web's most-cited LLM bloggers.
Eight years of wanting, three months of building with AI
Simon Willison's favorite long-form piece on agentic engineering: Lalit Maganti spent 8 years thinking about building SQLite devtools, then 3 months shipping it with Claude Code. Case study on how AI coding agents flatten the 'I need to fully understand this' procrastination curve.
GLM-5.1: Towards Long-Horizon Tasks
Z.ai's new 754B parameter MIT-licensed model targets long-horizon agentic tasks. Simon's pelican benchmark produces a surprisingly good SVG plus unprompted CSS animations — a first from any open weights model.
Highlights from my conversation about agentic engineering on Lenny's Podcast
Simon Willison on Lenny Rachitsky's podcast: we've passed the November 2025 inflection point on coding agents. Software engineers are the bellwether for how other information workers will adopt AI — responsible vibe coding, writing code on your phone, and what comes next.
Vibe coding SwiftUI apps is a lot of fun
Simon vibe codes macOS apps with Claude Opus 4.6 and GPT-5.4 — a full SwiftUI app fits in a single text file, no Xcode required. He builds Bandwidther (network monitor) and Gpuer (GPU monitor) as menu bar tools in one sitting.
My fireside chat about agentic engineering at the Pragmatic Summit
Simon's fireside chat on the stages of agentic engineering: from 'ChatGPT as occasional helper' to 'the agent writes more code than I do' to 'supervising multiple parallel agents.' The inflection points that separate the phases — and the skills each one demands.
Perhaps not Boring Technology after all
Two years ago LLMs pushed developers toward popular languages because training data was richer. Simon argues this is no longer true — modern coding agents consume docs via 'uvx tool --help' and work fine with brand new, obscure libraries. The 'boring technology' argument may be dead.
Can coding agents relicense open source through a "clean room" implementation of code?
The chardet library debate: can coding agents pull the Compaq BIOS clean-room trick in hours instead of months? Simon on the legal and ethical gray zone of AI-generated 'clean room' reimplementations that sidestep copyleft licenses.
Steve Yegge on AI adoption at Google
Google insider claims 60% of engineers still on basic chat tools, 20% agentic power users — the same adoption curve as John Deere. Demis Hassabis publicly rebutted. Captures the messy middle of AI adoption even at frontier labs.
Frontier LLM Releases & Benchmarks
Simon's hands-on notes on the latest frontier LLMs — Muse Spark, GLM-5.1, GPT-5.4 mini/nano, Claude Mythos, Google Gemma 4. Real-world testing via the legendary pelican-on-a-bicycle benchmark.
Google AI Edge Gallery
Google's official iOS app for running Gemma 4 models locally on iPhone. The 2.54GB E2B model is fast and actually useful — image Q&A, audio transcription, and tool calling demos across 8 interactive HTML widgets.
Meta's new model is Muse Spark, and meta.ai chat has some interesting tools
Meta's first major release since Llama 4 — Muse Spark launches as hosted-only with Instant/Thinking modes and benchmarks competitive with Opus 4.6 and Gemini 3.1 Pro. Simon runs it through his signature pelican-on-a-bicycle benchmark.
Anthropic's Project Glasswing—restricting Claude Mythos to security researchers—sounds necessary to me
Anthropic held back Claude Mythos from general release due to its offensive security capabilities — the model found thousands of zero-days across every major OS and browser. Project Glasswing restricts access to security researchers until defenders catch up.
Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
A 340M-parameter LLM trained entirely on 28,000 out-of-copyright Victorian British Library texts — zero post-1899 input. A fascinating answer to the training data copyright debate: what if you could train an LLM that couldn't possibly be stealing from anyone?
GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52
GPT-5.4 nano lands at $0.20/$1.25 per million tokens — cheaper than Gemini 3.1 Flash-Lite. Simon calculates it can describe 76,000 photos for $52, proving vision-language is now effectively free at scale.
Gemma 4 audio with MLX
One-line uv recipe to run Gemma 4 audio transcription locally on macOS with mlx-vlm. Demo has minor misheard words but runs fully offline. The 2026 local-first AI pattern in 8 commands.
Python, Datasette & Open Source
The Python and open source ecosystem: Starlette 1.0, OpenAI's Astral acquisition (uv/ruff/ty), clean-room reimplementation debates, and the evolving role of maintainers in the age of coding agents.
The Axios supply chain attack used individually targeted social engineering
The Axios supply chain attack postmortem reveals a new threat model: attackers didn't exploit code — they cloned a company's founder and ran a personalized social engineering campaign via Slack. Open source maintainers are now the attack surface.
Experimenting with Starlette 1.0 with Claude skills
Starlette 1.0 finally ships — the Python ASGI framework that powers FastAPI and half the modern Python web stack. Simon uses Claude skills to explore the new release and port his own asgi-gzip middleware.
Thoughts on OpenAI acquiring Astral and uv/ruff/ty
OpenAI acquires Astral — the company behind uv, ruff, and ty, three load-bearing pieces of modern Python tooling. Simon on what it means for the Python ecosystem, open source sustainability, and why this acquisition is different from past big-co OSS takeovers.
Security & Investigative Analysis
Security research and investigative experiments: supply chain attacks, social engineering threat models, LLM-powered user profiling, and the new attack surfaces that emerge when AI meets open source.
More from Simon Willison
Research notes, developer tools, and AI analysis from one of the web's most-read LLM bloggers.
ChatGPT voice mode is a weaker model
Non-obvious fact: ChatGPT voice mode runs a GPT-4o-era model from April 2024, while Codex works for hours on complex tasks. Paired with Karpathy thread explaining why verifiable domains (code) get disproportionate training attention vs writing/voice.
Gemini 3.1 Flash TTS
Google ships a prompt-directed TTS model. Simon documents the surprisingly baroque prompt format — character profiles, scene setup, director notes — as the interface for controlling voice. Representative of the 2026 shift where every modality gets its own prompt engineering dialect.
Start reading, not hoarding.
Import this vault to Burn 451 and actually read what matters.
Content attributed to original authors. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.