{"componentChunkName":"component---src-templates-blog-post-js","path":"/why-hasnt-longer-horizon-training-slowed-ai-progress/","result":{"data":{"site":{"siteMetadata":{"title":"sean goedecke"}},"markdownRemark":{"id":"be2c4209-dd6d-57c8-bb52-3a26e158a359","excerpt":"Dwarkesh Patel recently posted an award for the best answers to four key questions about AI. It’s partly a challenge and partly a job interview, since some of…","html":"<p>Dwarkesh Patel<sup id=\"fnref-1\"><a href=\"#fn-1\" class=\"footnote-ref\">1</a></sup> recently <a href=\"https://www.dwarkesh.com/p/blog-prize\">posted</a> an award for the best answers to four key questions about AI. It’s partly a challenge and partly a job interview, since some of the winners will get offered a role as a “research collaborator”. I don’t want the job, but I do want to write down my answer to his first question: <strong>why hasn’t AI progress slowed down more?</strong></p>\n<p>There are a few reasons we might think AI progress would slow down. The particular reason Dwarkesh is interested in goes like this. Training a model (specifically reinforcement learning) requires the model to perform a task and then get “graded” on the output. As models get more powerful and tasks become harder, they take longer and require more FLOPs<sup id=\"fnref-2\"><a href=\"#fn-2\" class=\"footnote-ref\">2</a></sup> to complete, and thus more FLOPs to train: thus training harder models will take longer.</p>\n<p>But intuitively, AI progress hasn’t slowed down that much. The famous METR horizon-length <a href=\"https://metr.org/time-horizons/\">graph</a> shows that AI systems are capable of more and more complex tasks over time, and that this process is accelerating, not slowing down. Why would that be?</p>\n<h3>What’s in a FLOP?</h3>\n<p>Firstly, <strong>it might just be the case that newer models are benefiting from orders of magnitude more FLOPs</strong>. Of course, AI labs aren’t standing up orders of magnitude more GPUs (they’re trying, but there are hard physical limits on how fast you can scale up a physical datacenter). But it’s certainly possible that they’re learning to use their existing FLOPs orders of magnitude more efficiently.</p>\n<p>The efficiency of complex software systems - and the training code for a frontier AI model certainly qualifies - is not typically determined by the number of genius ideas in it. It is determined by the number of boneheaded mistakes. Take <a href=\"https://www.dwarkesh.com/p/what-i-learned-april-15\">this story</a><sup id=\"fnref-3\"><a href=\"#fn-3\" class=\"footnote-ref\">3</a></sup> of how the initial GPT-4 training run used FP16 when summing many small values, which will <em>completely</em> mess up your results if the sum of those values is large. How much training-efficiency-per-FLOP does solving bugs like that buy? Plausibly enough to outweigh any inherent lack of efficiency from training more powerful models.</p>\n<h3>People are bad at judging intelligence</h3>\n<p>Secondly, <strong>intuitions about the speed of AI progress <a href=\"/are-new-models-good\">are weird and unreliable</a></strong>. Humans measure AI progress - and intelligence in general - on a really uneven scale. It’s easy to tell when an AI (or a person) is less smart than you, because you can just see them making mistakes. It’s very hard to tell if they’re smarter, because in that case you’re the one making mistakes. You have to rely on more subtle context clues: do they get better long-term results than you, or do they often confuse you in situations where you later end up agreeing with them, and so on.</p>\n<p>The jump from GPT-3 to GPT-4 seemed <em>huge</em> because GPT-4 was dumber than almost all humans, and GPT-4 was sometimes as smart as a human. However, frontier models are now smart enough to be in the realm of ambiguity on many topics. It’s thus much harder to tell the “real” rate at which they’re getting smarter. Maybe the rate of growth of “raw intelligence” really has slowed down! I don’t know how we’d be in a position to know for sure.</p>\n<h3>Intelligence is not the sole determinant of capability</h3>\n<p>Thirdly, <strong>many traits other than intelligence determine the capabilities of AI models</strong>. Take the jump in October last year where OpenAI and Anthropic models were suddenly “agentic” (i.e. they could reliably perform complex tasks end-to-end). That might be intelligence, but it might also just be a greater working memory, or more rote familiarity with the basic tools of a LLM harness, or more ability to attend to the context window, or even simply a <a href=\"/ai-personality-space/\">personality</a> more suited to tools like Claude Code or Codex. Of course, all of these traits are plausibly “intelligence”. But they’re traits you might instil by various clever tricks (or even just tweaking the system prompt), not by brute-forcing more FLOPs.</p>\n<p>It’s illustrative here to consider the mistake made by Apple’s infamous <a href=\"/illusion-of-thinking/\"><em>The Illusion of Thinking</em></a> paper, where the researchers asked various models to brute-force solve Tower of Hanoi puzzles with different numbers of disks, using the results to score how good at reasoning the models were. But of course when you read the output, all of the failures were cases of the model realizing that many hundreds of steps were required, and refusing to even try. These same models could trivially write code to perform the steps, or correctly go through any smaller subset of the steps. The problem wasn’t intelligence, it was <em>persistence</em>: these models lacked the willingness to dig in and keep powering through steps until they got to an answer<sup id=\"fnref-5\"><a href=\"#fn-5\" class=\"footnote-ref\">5</a></sup>.</p>\n<h3>Final thoughts</h3>\n<p>Even inside an AI lab, I don’t think anyone has a good understanding of how many “real” FLOPs are being thrown at a training run (not counting FLOPs that are wasted on bugs). We also don’t have a clear sense of whether AI progress really is slowing down or not. Mythos seems impressive, and coding agents are really good now, but once the models get close to human intelligence it becomes really tricky to monitor. Finally, almost everyone judges intelligence by capabilities, but capabilities are produced by a constellation of many traits (intelligence is just one of them).</p>\n<p>I think this stuff is really complicated. A general theory like “RL takes more flops-per-reward as tasks get longer, therefore training will gradually slow down” sounds good, but in practice AI development is dominated by lightning strikes: silly bugs that make training a hundred times worse, clever ideas that make models a hundred times more useful, and spiky capabilities that can produce dazzling results in some areas but zero improvement in others. We are still <a href=\"/ai-and-informal-science/\">very early</a>.</p>\n<div class=\"footnotes\">\n<hr>\n<ol>\n<li id=\"fn-1\">\n<p>If you’re reading this you probably know who Dwarkesh is, but if you don’t: he’s a well-known tech-adjacent podcaster whose gimmick is that he actually does extensive research before each guest and asks specific technical questions.</p>\n<a href=\"#fnref-1\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-2\">\n<p>A FLOP is a floating-point operation, i.e. a matrix multiplication, i.e. “time on a GPU”.</p>\n<a href=\"#fnref-2\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-3\">\n<p>I saw this in a tweet and only realized that the source was Dwarkesh when I was researching for this post.</p>\n<a href=\"#fnref-3\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-4\">\n<p>What if AI progress stalls for technical reasons, and everyone gives up on training new models? In that world, open source models will <em>eventually</em> catch up, and AI labs won’t be in a privileged position.</p>\n<a href=\"#fnref-4\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-5\">\n<p>Incidentally, this is my pet theory about why models got much better at agentic tasks last year: training on longer and longer agentic traces meant that models started to “believe they could do it”, and made them much less likely to just give up and take shortcuts or refuse to continue.</p>\n<a href=\"#fnref-5\" class=\"footnote-backref\">↩</a>\n</li>\n</ol>\n</div>","frontmatter":{"title":"Why hasn't longer-horizon training slowed AI progress?","description":null,"date":"May 7, 2026","tags":null}}},"pageContext":{"slug":"/why-hasnt-longer-horizon-training-slowed-ai-progress/","previous":{"slug":"/staff-engineer-archetypes/","title":"Why I don't like the \"staff engineer archetypes\""},"next":null,"preview":null}},"staticQueryHashes":["1146911855","3764592887"]}