All Vaults
FC

François Chollet

AGI Evaluation & ARC-AGI

The François Chollet reading vault — ARC-AGI benchmark, the Measure of Intelligence paper, Keras philosophy, and why scaling alone won't reach AGI. 2026 trending topic as ARC-AGI Prize 2 launched.

28 articles·Updated 5/11/2026·
Curated by@hawking520
Get Burn 451

About this vault

Curated reading vault of François Chollet — creator of Keras, author of 'On the Measure of Intelligence', co-founder of the ARC-AGI Prize, and the most-cited AI skeptic of pure scaling. This vault collects his essays, papers, and talks on what intelligence actually is (skill-acquisition efficiency, not benchmark performance), why ARC-AGI is the only AI benchmark that still resists pure scaling, the gap between current LLMs and AGI, and his Keras design philosophy. 2026 makes this vault urgent: ARC-AGI Prize 2 launched and Chollet's framework is suddenly center-stage in AGI discourse. Each piece will have an AI summary so you can decide what to read in seconds. Auto-synced when Supabase collection is seeded.

28 articles

Chollet's Response to 'But ChatGPT Can Do This'

A thread addressing the common argument that because ChatGPT can perform impressively on various tasks, it must be approaching AGI. Chollet argues that task performance on training-distribution tasks is not evidence of general intelligence — it is evidence of the quality of training data and scale. He uses the analogy of a well-stocked library: a library can "answer" many questions correctly, but the library itself isn't intelligent. The intelligence (if any exists) was that of the writers who contributed to it. LLMs are very efficient libraries, but libraries nonetheless. The test for AGI is whether the "library" can write new books in domains it has never seen — which is exactly what ARC tests.

Machine Learning Street Talk: Chollet on the ARC Prize and Program Synthesis

Chollet discusses with MLST why program synthesis approaches outperformed pure LLM approaches in the 2024 ARC Prize. He explains the technical differences between the winning approaches — which used learned search guided by learned abstract reasoning — and pure language model approaches. The conversation covers why inductive logic programming, neurosymbolic systems, and learned program synthesis are the most promising paths forward for the type of reasoning ARC tests. Chollet also discusses why he expects the 2025 ARC-AGI-2 to take longer to crack than the original, given that many of the exploitable shortcuts have been closed.

Deep Learning with Python (3rd ed.) — Author Notes

Chollet's preface and author notes for the third edition of Deep Learning with Python, updated to cover the transformer era. The notes reflect on how much has changed since the first edition in 2017 — attention mechanisms have displaced RNNs entirely, transformers are the universal architecture, and the question has shifted from "can we build useful AI?" to "how do we understand and align it?" Chollet discusses why he included new material on the limitations of deep learning, the importance of interpretability, and the distinction between statistical learning and reasoning that motivates his ARC work. The book remains the most-read practical deep learning textbook.

The Difference Between Skill and Intelligence

Chollet distinguishes between skill (high performance on a specific task acquired through practice) and intelligence (the capacity to efficiently acquire new skills). The post argues that current AI discourse conflates these two properties — when GPT-4 writes good code or essays, it is demonstrating accumulated skill from training data, not intelligence per se. Intelligence would mean quickly mastering a completely novel domain with minimal examples. The post uses chess as an analogy: a chess grandmaster has extraordinary chess skill, but that skill doesn't transfer to novel games. Intelligence is what allows you to learn a new game quickly, not excel at one you've played 10,000 times.

NeurIPS 2019 Talk: The Measure of Intelligence

Chollet's NeurIPS 2019 presentation of his intelligence measurement framework. The talk covers the formal definition of intelligence, the design principles behind ARC, and why he believes the field needs a fundamentally different type of benchmark. The presentation includes live examples of ARC tasks to illustrate why they require genuine abstraction — not pattern matching — to solve. This is the original academic presentation of ideas that later reached a much wider audience through the ARC Prize. Notable: the Q&A session where Yoshua Bengio and others challenge Chollet's framework, and his responses defending the core distinction between skill and intelligence.

ARC Prize Blog: Why We Think Test-Time Compute Is Key

Post-2024-results analysis arguing that test-time compute (deliberate search at inference time) is the most promising direction for bridging the gap between LLM capabilities and ARC-level reasoning. The analysis covers why the top 2024 submissions all used forms of search rather than pure next-token prediction, and why this suggests the field should invest in better search algorithms guided by learned value functions rather than simply scaling pretraining. Chollet draws an analogy to AlphaGo: the breakthrough came not from memorizing more games but from combining learned value estimation with Monte Carlo Tree Search. ARC-solving systems may need an analogous combination.

ARC Prize 2025: Announcing ARC-AGI-2 and the $2M Prize

Chollet and Mike Knoop announce the 2025 ARC Prize with a doubled $2 million total prize pool and ARC-AGI-2 as the new benchmark. The post explains what changed between ARC-AGI (original) and ARC-AGI-2 in terms of task design: more compositional tasks, explicit multi-step reasoning requirements, and the removal of shortcuts that allowed pattern-matching approaches to score above 50% in 2024. The 2025 prize explicitly targets the capability gap between the best 2024 systems (55.5%) and human performance (85%), which Chollet frames as the most important unsolved benchmark in AI.

Chollet on X: The Problem With Current AI Safety Framing

Chollet argues that much of the AI safety discourse is premised on a misunderstanding of what current systems can do. Fears of misaligned superintelligence assume a system that can generalize from limited goals to arbitrary capability — exactly the kind of generalization that ARC demonstrates current systems cannot perform. This doesn't mean AI is safe; it means the actual risks are different from the speculative superintelligence risks, and misidentifying the risk leads to misallocated safety research. Chollet advocates for safety research focused on real and present harms (bias, misuse, economic disruption) rather than speculative existential scenarios premised on capabilities systems don't yet have.

Chollet on X: Abstraction vs. Compression

Chollet draws a precise distinction between compression (finding compact representations of observed data) and abstraction (building executable programs that generalize to unobserved cases). LLMs are remarkable compressors — they encode vast amounts of human knowledge in compressed neural representations. But abstraction requires something more: the ability to construct new programs, not just recall compressed versions of old ones. ARC tasks test specifically for abstraction, not compression. This distinction has implications for interpretability research: interpreting an LLM's internal representations is studying compression; understanding AGI requires studying abstraction.

ARC Prize: Open Source ARC-AGI Solutions Breakdown

Analysis of the open-source ARC-solving approaches published after the 2024 competition. The most effective open-source approach combined LLM-based task description with a learned search over candidate programs — a practical implementation of the hybrid approach that topped the leaderboard. The post covers the specific techniques used, why certain approaches failed despite seeming promising, and what the open-source community should focus on for 2025. Chollet identifies the key bottleneck as not LLM capability but search efficiency: having a good value function for evaluating candidate solutions during search.

ARC Prize 2024: Launching the $1M Challenge

Chollet and Mike Knoop announce the $1 million ARC Prize to accelerate progress on the ARC-AGI benchmark. The post explains why ARC-AGI remains the hardest publicly available AI benchmark despite years of LLM scaling — GPT-4o achieved only 5% on the test, while humans score 85%. The challenge is structured to specifically reward systems that demonstrate genuine program synthesis and abstraction rather than pattern matching from training data. The post frames ARC not as a game or competition, but as a scientific instrument for measuring a specific type of cognitive capability that current AI systems demonstrably lack.

Chollet on X: AGI Is Not About Passing Turing Tests

Chollet argues against using conversational fluency or Turing test performance as an AGI marker. A system that sounds intelligent in conversation is demonstrating training distribution performance, not novel reasoning. The thread distinguishes between "social intelligence" (predicting what a human wants to hear, which LLMs excel at) and "abstract reasoning" (solving new problems by constructing internal representations, which ARC tests). AGI should require the latter. Chollet also addresses the common objection that "humans also just retrieve patterns" — he argues humans can demonstrably solve ARC-type novel tasks with minimal examples, which LLMs cannot.

Chollet on X: The Role of Prior Knowledge in Intelligence

Chollet discusses how ARC is designed to allow for certain universal priors (objectness, basic counting, symmetry recognition) while controlling for domain-specific prior knowledge. The distinction matters because intelligence should not be measured in a vacuum — all intelligent systems bring some prior knowledge. The question is whether the task can be solved with minimal, universal priors rather than extensive task-specific training. ARC tasks are specifically constructed so that a human with no mathematical training can solve them using only basic visual reasoning, but a sophisticated ML system cannot solve them despite training on billions of examples.

Chollet on X: Meta-Learning and the Path to General Intelligence

Chollet identifies meta-learning — learning to learn efficiently — as a promising direction for addressing the fundamental limitation of current ML systems. Meta-learning approaches train systems not to perform well on specific tasks but to quickly adapt to new tasks using few examples. This is directly aligned with the intelligence definition in his 2019 paper: efficiency of skill acquisition. The thread surveys existing meta-learning approaches (MAML, ProtoNets, few-shot learning), their current limitations, and why he believes combining meta-learning with program synthesis could produce systems capable of solving ARC-level tasks.

Chollet on X: Intelligence Is Not About Information Storage

Chollet challenges the common framing that "more parameters = more intelligence." Parameters store learned patterns, but intelligence is not the quantity of stored patterns — it's the efficiency of constructing new ones. A hard drive stores more information than a human brain, but no one claims hard drives are intelligent. The thread draws the implication: making LLMs larger does improve certain forms of knowledge retrieval, but this improvement is not on the intelligence axis. The relevant axis is not "how much can you store" but "how quickly can you adapt to genuinely novel tasks with minimal examples.

Chollet on X: Program Synthesis as the Missing Ingredient

Chollet identifies program synthesis — the ability to construct executable programs from examples — as the key capability that separates current LLMs from systems that could solve ARC-style tasks. Program synthesis doesn't mean writing code in Python; it means internally constructing a step-by-step procedure that generalizes from input-output examples to unseen inputs. Humans do this effortlessly for simple grid puzzles. LLMs cannot do it reliably because they don't construct internal programs — they do weighted interpolation between memorized examples. The thread explains what "learned program synthesis" research programs like DreamCoder and Chollet's own work are attempting.

Keras: Deep Learning for Humans

Chollet's design philosophy for Keras, which he built at Google as a human-centered deep learning API. The core principle: APIs should be designed for maximum clarity and usability, not maximum expressiveness. Keras prioritizes "readability" of code — a Keras model should be immediately understandable to someone who didn't write it. Chollet's insistence on "progressive disclosure of complexity" (simple defaults for common use cases, with full flexibility available when needed) became influential across software API design beyond ML. The philosophy reflects his broader belief that tools shape thinking, and overly complex tools create practitioners who think in tool-specific terms rather than problem-specific terms.

Francois Chollet on Lex Fridman Podcast: ARC-AGI and the Nature of Intelligence

In this three-hour Lex Fridman podcast, Chollet argues that LLMs are sophisticated interpolation engines — they excel at retrieving and remixing patterns from training data but cannot perform the kind of systematic program synthesis that ARC tasks require. He distinguishes between "memorization plus interpolation" (what LLMs do) and "abstraction plus extrapolation" (what intelligence requires). Chollet explains why he believes the path to AGI runs through neurosymbolic approaches or learned program synthesis rather than scale alone, and why the ARC benchmark specifically targets the capability gap that no amount of more training data will close.

On the Measure of Intelligence

Chollet's landmark 2019 paper argues that current AI benchmarks measure crystallized task-specific skill rather than general intelligence. He proposes a formal definition of intelligence as "skill-acquisition efficiency" — how quickly a system can adapt to novel tasks given limited experience — and introduces the Abstraction and Reasoning Corpus (ARC) as a benchmark that measures this efficiency rather than performance on memorized tasks. The paper distinguishes between developer prior, experience, and skill, arguing that a fair test of intelligence must control for all three. This framework directly challenges scaling as the path to AGI: a system that achieves high ARC scores through brute-force training has not demonstrated intelligence in the meaningful sense.

Chollet on X: Why LLMs Cannot Solve ARC Through Scale Alone

Thread explaining why the ARC benchmark is specifically designed to resist the type of performance improvement that scaling data and parameters provides. Chollet argues that LLMs improve at tasks similar to their training distribution, but ARC tasks are explicitly designed to be out-of-distribution — each puzzle is novel by construction. Training on more ARC tasks would help on those specific tasks but would not improve performance on new novel tasks (which is the whole point). This is a clear illustration of why "more data + bigger model" cannot solve ARC: ARC is measuring something that is not improved by those interventions.

Chollet on X: Why Current Scaling Will Not Produce AGI

Chollet argues that current scaling trajectories will not produce AGI because they optimize for a different objective: performance on training distribution tasks. Scaling improves in-distribution performance efficiently, but AGI requires out-of-distribution generalization (what ARC measures) which current training objectives don't directly reward. The thread is not an argument that scaling is useless — it's an argument that scaling alone (same objective, more data, more parameters) cannot bridge the gap to genuine generalization. A different ingredient — possibly something like learned program synthesis, meta-learning, or a fundamentally different training objective — is needed.

ARC Prize Blog: The Gap Between Narrow and General AI

A broad audience explainer distinguishing narrow AI (systems that do one thing very well through extensive task-specific training) from general AI (systems that efficiently adapt to new tasks). The post uses concrete examples: GPT-4 is excellent at writing English prose because it trained on billions of English documents — that's narrow AI, even if the task is cognitively complex. ARC tasks are narrow by design in their structure but general in their novelty requirement: they test whether a system can adapt to task variants it has never seen with no additional training. The post is Chollet's most accessible statement of why "impressive AI performance" is not the same as "progress toward AGI.

ARC-AGI GitHub Repository and Dataset Documentation

The official ARC-AGI dataset repository with Chollet's documentation explaining the design principles. The README explains the three core priors ARC tasks rely on (objectness, goal-directedness, counting/arithmetic, and symmetry), why each task is novel by construction, and why human-level performance requires only these simple priors rather than domain knowledge. The documentation is essential reading for anyone attempting to build a system that solves ARC — it clarifies exactly what prior knowledge is "allowed" and how tasks were designed to be solvable by anyone with basic human reasoning without domain expertise.

Chollet on X: o3's ARC-AGI Score and What It Means

Chollet's landmark X thread reacting to OpenAI's o3 model achieving 87.5% on ARC-AGI with heavy test-time compute. He frames this as significant but context-dependent — o3 achieves the score using what he estimates as roughly $10,000-$100,000 of compute per task, which is not the efficiency that intelligence implies. He draws a distinction between the result being impressive and the method being efficient, arguing that the path from o3 to genuine AGI still requires combining the core capability with much greater efficiency. The thread is the most shared piece of Chollet's writing in 2024 and shaped the broader discourse around o3.

Chollet on X: What the ARC Semi-Private Eval Means

Explanation of why ARC uses a semi-private evaluation set — tasks not publicly available at test time — to prevent overfitting to the test distribution. Chollet explains that any publicly released benchmark becomes a training target: researchers fine-tune models on it, which inflates scores without reflecting genuine capability improvement. The semi-private eval is designed to measure actual generalization by ensuring models cannot be pre-trained on the test tasks. This design choice is philosophically important: it prioritizes measurement integrity over convenience and reflects Chollet's overall position that benchmark design is the crux of AI progress measurement.

Dwarkesh Patel Interviews Chollet: Why GPT-4 Fails at ARC

Chollet explains to Dwarkesh Patel why he believes LLMs have hit a fundamental ceiling for tasks requiring genuine novel reasoning. The interview covers why the ARC tasks — simple visual grid puzzles solvable by children — completely stump state-of-the-art LLMs, while programs like the 2024 ARC Prize winner made meaningful progress using hybrid approaches. Chollet discusses the o3 breakthrough (achieving 87% with heavy test-time compute) as evidence that the right direction involves deliberate search rather than more training data, while noting that o3's approach has significant practical limitations due to compute cost.

ARC Prize 2024 Technical Report: What We Learned

The official post-competition analysis of the 2024 ARC Prize results. The top-scoring submission achieved 55.5% by combining a fine-tuned LLM with a learned search procedure — demonstrating that hybrid approaches significantly outperform pure LLM prompting (which peaks around 30%). The analysis explains what technical approaches worked, what didn't, and why the 40-55% score range was particularly hard to penetrate. Chollet draws lessons for ARC-AGI-2 design and for the broader field: the combination of learned representations with explicit search appears more promising than either alone.

ARC-AGI-2: Why the New Benchmark Is Even Harder

Chollet announces ARC-AGI-2, a harder successor benchmark designed after the top 2024 ARC Prize winner achieved 55.5% using hybrid neurosymbolic approaches. The new benchmark introduces more compositional reasoning tasks, longer chains of inference, and fewer visual shortcuts. ARC-AGI-2 targets the 40-50% gap that even the best 2024 systems couldn't cross, aiming to distinguish genuine compositional reasoning from the pattern-matching strategies that allowed top-performing 2024 submissions to score over 50%. Human baseline on ARC-AGI-2 remains above 80%.

Start reading, not hoarding.

Import this vault to Burn 451 and actually read what matters.

Frequently asked questions

Who is François Chollet?

François Chollet is covered in this Burn 451 vault with a focus on agi evaluation & arc-agi. The François Chollet reading vault — ARC-AGI benchmark, the Measure of Intelligence paper, Keras philosophy, and why scaling alone won't reach AGI. 2026 trending topic as ARC-AGI Prize 2 launched.

How was the François Chollet vault curated?

The François Chollet vault was hand-curated by the Burn 451 editorial team from publicly available essays, blog posts, podcast transcripts, and social threads. Each piece includes an AI-generated summary so readers can triage in seconds. The vault auto-syncs as new content from François Chollet is published.

How many articles are in the François Chollet vault?

The François Chollet vault currently contains 28 curated pieces organized by topic, not chronology. Each article has an AI summary and a direct link to the original source. Items are refreshed hourly through Burn 451's ISR pipeline, so new publications appear within a day.

How do I use this vault with Claude or Cursor?

Install the burn-mcp-server package from npm and connect it to Claude, Cursor, or any MCP-compatible AI tool. The vault becomes queryable as live context — your AI can search, summarize, and cite articles from François Chollet directly in conversation without manual copy-paste or re-uploading files.

What is Burn 451?

Burn 451 is a read-later app built around a 24-hour burn timer that forces daily triage. Articles you save must be read, vaulted, or released within 24 hours. The Vault layer — including this François Chollet collection — holds permanent curated reading lists for AI thought leaders, founders, and researchers.

Content attributed to original authors. Burn 451 curates publicly available writing as a reading index. For removal requests, contact @hawking520.