{"componentChunkName":"component---src-templates-blog-post-js","path":"/the-o3-geoguessr-prompt-did-not-work/","result":{"data":{"site":{"siteMetadata":{"title":"sean goedecke"}},"markdownRemark":{"id":"5ac6571c-3382-52fe-9945-b4cd30e18fe2","excerpt":"In April last year, Kelsey Piper discovered that OpenAI’s o3 model was surprisingly good at figuring out where a photo was taken from. Like human “geoguessr…","html":"<p>In April last year, Kelsey Piper <a href=\"https://x.com/KelseyTuoc/status/1917340813715202540\">discovered</a> that OpenAI’s o3 model was surprisingly good at figuring out where a photo was taken from. Like human “geoguessr” <a href=\"https://www.youtube.com/@georainbolt\">pros</a>, o3 could sometimes take a nondescript photo of a beach and tell you exactly where it is. Here’s the example Kelsey gave:</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/4113a246112c8b6db424a58af58a9a90/4d836/kelsey-geoguessr.jpg\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 130.40540540540542%; position: relative; bottom: 0; left: 0; background-image: url('data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAaABQDASIAAhEBAxEB/8QAGQAAAwADAAAAAAAAAAAAAAAAAAIEAQMF/8QAFgEBAQEAAAAAAAAAAAAAAAAAAAID/9oADAMBAAIQAxAAAAHdTzKdIsIQiExlbCB//8QAGRAAAwEBAQAAAAAAAAAAAAAAAQIRABAg/9oACAEBAAEFAleZTWJGDrQ95dfP/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAwEBPwEf/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAgEBPwEf/8QAGBAAAgMAAAAAAAAAAAAAAAAAABARMFH/2gAIAQEABj8CJW0//8QAHBABAAIDAAMAAAAAAAAAAAAAAQARECFRMWFx/9oACAEBAAE/IWRyE1Ja8k2FsKNWHqL9wv1luDH/2gAMAwEAAgADAAAAECAhz//EABQRAQAAAAAAAAAAAAAAAAAAACD/2gAIAQMBAT8QH//EABURAQEAAAAAAAAAAAAAAAAAABEg/9oACAECAQE/EGP/xAAbEAEAAwEBAQEAAAAAAAAAAAABABExIWGRUf/aAAgBAQABPxBwXjRnYAMagGIP2U1l6UQmIDkWufUER60Xgxu1i9mWK/s//9k='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"geo\"\n        title=\"geo\"\n        src=\"/static/4113a246112c8b6db424a58af58a9a90/1c72d/kelsey-geoguessr.jpg\"\n        srcset=\"/static/4113a246112c8b6db424a58af58a9a90/a80bd/kelsey-geoguessr.jpg 148w,\n/static/4113a246112c8b6db424a58af58a9a90/1c91a/kelsey-geoguessr.jpg 295w,\n/static/4113a246112c8b6db424a58af58a9a90/1c72d/kelsey-geoguessr.jpg 590w,\n/static/4113a246112c8b6db424a58af58a9a90/a8a14/kelsey-geoguessr.jpg 885w,\n/static/4113a246112c8b6db424a58af58a9a90/4d836/kelsey-geoguessr.jpg 920w\"\n        sizes=\"(max-width: 590px) 100vw, 590px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n  </a>\n    </span></p>\n<p>Several people <a href=\"https://www.astralcodexten.com/p/testing-ais-geoguessr-genius\">reproduced this</a> with good results: not a 100% success rate, but clearly <em>far</em> better than you’d do with a random human guess. The lesson here is that <strong>model capabilities can surprise us</strong>. The o3 model had been released for two weeks before Kelsey’s tweet without anyone noticing how good it was at geolocation. What obscure capabilities did we never find? What capabilities of current models are we missing today?</p>\n<p>Some people drew <a href=\"https://newsletter.angularventures.com/p/ai-s-geoguessr-genius-and-the-art-of-prompting-well\">another</a> <a href=\"https://www.reddit.com/r/singularity/comments/1kep2bp/comment/mqlvv1a/\">lesson</a> from this: that “prompt engineering” can unlock brand-new capabilities. This is because Kelsey had a <a href=\"https://raw.githubusercontent.com/sgoedecke/ai_geolocation/refs/heads/main/prompts/geoguessr_protocol.txt\">magic prompt</a> that she built over time. When o3 got something wrong, she would ask it how it could have avoided the mistake, and then included that in the prompt. Here’s the first 10% of that prompt, so you get the idea:</p>\n<blockquote>\n<p>You are playing a one-round game of GeoGuessr. Your task: from a single still image, infer the most likely real-world location. Note that unlike in the GeoGuessr game, there is no guarantee that these images are taken somewhere Google’s Streetview car can reach: they are user submissions to test your image-finding savvy. Private land, someone’s backyard, or an offroad adventure are all real possibilities (though many images are findable on streetview). Be aware of your own strengths and weaknesses: following this protocol, you usually nail the continent and country…</p>\n</blockquote>\n<p>This prompt impressed a lot of people, who <a href=\"https://www.reddit.com/r/singularity/comments/1kep2bp/comment/mqo3yzz/\">tried</a> <a href=\"https://www.thealgorithmicbridge.com/p/upload-a-picture-to-chatgpt-itll\">it</a> <a href=\"https://www.astralcodexten.com/p/testing-ais-geoguessr-genius\">out</a> and reported that it correctly identified a lot of images. But of course, o3 correctly identified a lot of images with just a basic “think carefully about where this picture was taken?” prompt. Did the prompt actually help? It’d be tough to figure that out just from playing around in ChatGPT. You’d need to build an evaluation set of images and run o3 against them twice: once with the fancy prompt and once without it.</p>\n<p>So <a href=\"https://github.com/sgoedecke/ai_geolocation/tree/main\">that’s what I did</a>. I pulled 200 images from Wikimedia Commons, Geograph Britain and Ireland, and iNaturalist for the benchmark. You can read the AI-generated summary <a href=\"https://github.com/sgoedecke/ai_geolocation/blob/main/results/dataset_mixed_200_o3_high_report.md\">here</a>, but here’s the key table:</p>\n<table>\n<thead>\n<tr>\n<th>Prompt</th>\n<th align=\"right\">n</th>\n<th align=\"right\">Median km</th>\n<th align=\"right\">Mean km</th>\n<th align=\"right\">P25 km</th>\n<th align=\"right\">P75 km</th>\n<th align=\"right\">&#x3C;=25 km</th>\n<th align=\"right\">&#x3C;=100 km</th>\n<th align=\"right\">&#x3C;=500 km</th>\n<th align=\"right\">&#x3C;=1000 km</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Default</td>\n<td align=\"right\">200</td>\n<td align=\"right\"><strong>83.2</strong></td>\n<td align=\"right\"><strong>440.7</strong></td>\n<td align=\"right\"><strong>16.4</strong></td>\n<td align=\"right\"><strong>221.9</strong></td>\n<td align=\"right\">58</td>\n<td align=\"right\"><strong>109</strong></td>\n<td align=\"right\"><strong>176</strong></td>\n<td align=\"right\"><strong>182</strong></td>\n</tr>\n<tr>\n<td>GeoGuessr prompt</td>\n<td align=\"right\">200</td>\n<td align=\"right\">102.3</td>\n<td align=\"right\">481.9</td>\n<td align=\"right\">18.5</td>\n<td align=\"right\">277.8</td>\n<td align=\"right\"><strong>59</strong></td>\n<td align=\"right\">99</td>\n<td align=\"right\">172</td>\n<td align=\"right\">180</td>\n</tr>\n</tbody>\n</table>\n<p>In general, the basic prompt did better on average. It consistently guessed closer to the actual location. Both prompts did pretty well, actually. Despite the fancy prompt being 10x larger, it only caused o3 to think for slightly longer (about one second on average, though the max was about double, at 10 minutes instead of 5 minutes). The images in my benchmark were fairly generic geoguessr-style outdoor images, with twelve indoor images thrown in for an extra challenge (the fancy prompt also did slightly worse on these).</p>\n<p>What’s going on? I think this shows <strong>how easy it is to fool yourself about the quality of prompting</strong>. When the model is already pretty good at a task, you can give it a very elaborate prompt without impacting performance. It’ll still be pretty good, except this time it’s good <em>because of what you did</em>. This is particularly true if you’re iterating with the model and asking it “what should I add to the prompt” for each mistake. Models will happily make up stories for you about their own reasoning processes, and will almost always say “yes, that helped a lot!” when you ask them if a particular prompt tweak made things better. The only way to actually know is by constructing some kind of benchmark<sup id=\"fnref-1\"><a href=\"#fn-1\" class=\"footnote-ref\">1</a></sup>.</p>\n<p>It’s also interesting to me that nobody checked this at the time. It took me about six hours of fairly-distracted work and about $15 to construct and run this benchmark. Why didn’t anyone do this when they were writing articles about how good the o3 prompt was?</p>\n<p>One charitable reason might be that the story was more about o3’s real geolocation ability than about the magic prompt. The pricing for o3 also used to be about five times more expensive (though a benchmark of 40 images instead of 200 would still have thrown doubt on how much water the prompt was carrying). Also, AI just moves so <em>fast</em>. Geolocation was only the story for about a week: after that, GPT-4o’s <a href=\"/ai-sycophancy\">sycophancy</a> was what people were talking about. Another reason is that AI tooling wasn’t as good then. The benchmark was so easy for me to run because GPT-5.5 did most of the heavy lifting. Prior to strong agents, you would have had to write the (simple) benchmark yourself. I can’t point the finger too hard: I didn’t bother at the time either.</p>\n<p>Maybe my benchmark isn’t very good? The photos look reasonable enough: a wide variety of geoguessr-like shots of roads and landscapes, mostly. I could have tried to gather a few thousand photos instead of a few hundred, but if the magic prompt really was a big improvement you’d still expect to see that manifest on a benchmark this size. If someone wants to go and build a hundred-dollar geolocation benchmark instead of my fifteen-dollar one, I think that’d be an interesting project.</p>\n<p>Finally, let’s use the benchmark to answer a question I’ve had for a while: do gpt-5.4 and gpt-5.5 have o3’s geolocation abilities? The answer, apparently, is no.</p>\n<table>\n<thead>\n<tr>\n<th>Run</th>\n<th align=\"right\">Median km</th>\n<th align=\"right\">Mean km</th>\n<th align=\"right\">&#x3C;=25 km</th>\n<th align=\"right\">&#x3C;=100 km</th>\n<th align=\"right\">&#x3C;=500 km</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><strong>o3 default</strong></td>\n<td align=\"right\"><strong>83.2</strong></td>\n<td align=\"right\"><strong>440.7</strong></td>\n<td align=\"right\">58</td>\n<td align=\"right\"><strong>109</strong></td>\n<td align=\"right\"><strong>176</strong></td>\n</tr>\n<tr>\n<td>o3 GeoGuessr</td>\n<td align=\"right\">102.3</td>\n<td align=\"right\">481.9</td>\n<td align=\"right\"><strong>59</strong></td>\n<td align=\"right\">99</td>\n<td align=\"right\">172</td>\n</tr>\n<tr>\n<td>gpt-5.4 default</td>\n<td align=\"right\">163.3</td>\n<td align=\"right\">638.9</td>\n<td align=\"right\">26</td>\n<td align=\"right\">74</td>\n<td align=\"right\">148</td>\n</tr>\n<tr>\n<td>gpt-5.5 default</td>\n<td align=\"right\">156.5</td>\n<td align=\"right\">645.9</td>\n<td align=\"right\">39</td>\n<td align=\"right\">77</td>\n<td align=\"right\">161</td>\n</tr>\n</tbody>\n</table>\n<p>Whatever o3 had that made it good at this task hasn’t transferred to newer models. </p>\n<div class=\"footnotes\">\n<hr>\n<ol>\n<li id=\"fn-1\">\n<p>Benchmarks can mislead as well, but they’re better than just vibes.</p>\n<a href=\"#fnref-1\" class=\"footnote-backref\">↩</a>\n</li>\n</ol>\n</div>","frontmatter":{"title":"The famous o3 \"GeoGuessr\" prompt did not work","description":null,"date":"May 21, 2026","tags":["ai"]}}},"pageContext":{"slug":"/the-o3-geoguessr-prompt-did-not-work/","previous":{"slug":"/prompts-are-technical-debt-too/","title":"Prompts are technical debt too"},"next":null,"preview":{"slug":"/prompts-are-technical-debt-too/","title":"Prompts are technical debt too","snippetHtml":"<p>It’s <a href=\"https://www.tokyodev.com/articles/all-code-is-technical-debt\">common</a> and correct to say that “all code is technical debt”. Adding code is a necessary evil for developing new features: you almost always have to do it, but each line of code adds to the complexity and maintenance burden of the system. All future changes to the system have to work with the existing code, or at least avoid breaking it. Once systems accumulate enough code, they become impossible for a single person to understand: instead of reading the code and understanding what it does, you must rely on guesses, theories and heuristics. Sensible engineers write as little code as possible.<br /><a href=\"/prompts-are-technical-debt-too/\">Continue reading...</a></p>"}}},"staticQueryHashes":["1146911855","3764592887"]}