{"componentChunkName":"component---src-templates-blog-post-js","path":"/ai-inference-is-obviously-profitable/","result":{"data":{"site":{"siteMetadata":{"title":"sean goedecke"}},"markdownRemark":{"id":"186e4305-d534-55c5-a216-3257d619c818","excerpt":"Many people claim that AI inference is unprofitable to serve, and thus must be subsidized by an ocean of dumb money from investors who believe that some future…","html":"<p>Many people <a href=\"https://www.wheresyoured.at/why-everybody-is-losing-money-on-ai/\">claim</a> that AI inference is unprofitable to serve, and thus must be subsidized by an ocean of dumb money from investors who believe that some future AI model will come to dominate the world economy. When that dumb money goes away, so will AI products. According to this view, LLMs are just inherently too expensive (in terms of money, power, and water) to be used in consumer products. In fact, they can only be used today by externalizing the costs: money onto VC funds and now retail ETF <a href=\"https://www.investopedia.com/spacex-stock-joins-major-index-funds-what-regular-investors-need-to-know-spcx-ipo-vanguard-blackrock-vti-itot-12004764\">investors</a>, power onto electric utility <a href=\"https://salatainstitute.harvard.edu/how-you-subsidize-big-tech-with-your-electricity-bill/\">consumers</a>, and water onto the <a href=\"https://theconversation.com/5-ways-data-centers-endanger-their-local-communities-and-the-country-as-a-whole-282348\">communities</a> where datacenters are built.</p>\n<p>There are <a href=\"/is-ai-wrong/\">good reasons</a> to dislike AI, but this really isn’t one of them. In fact, <strong>AI inference is obviously profitable</strong>.</p>\n<h3>Doing the math demonstrates that inference is profitable</h3>\n<p>Frontier AI providers are reporting 70%-80% <a href=\"https://www.morningstar.com/stocks/anthropics-gross-margin-is-most-important-number-tech\">gross</a> <a href=\"https://www.saastr.com/have-ai-gross-margins-really-turned-the-corner-the-real-math-behind-openais-70-compute-margin-and-why-b2b-startups-are-still-running-on-a-treadmill/\">margins</a> on inference, but maybe we can’t trust them. Let’s do some very rough estimates on the actual cost.</p>\n<p>A Nvidia A100 consumes 400W of power under full load. In practice, even a carefully-tuned inference server will not be at full load all the time, but it’s at least an upper bound. Suppose you’re running a dense 70B model<sup id=\"fnref-1\"><a href=\"#fn-1\" class=\"footnote-ref\">1</a></sup>, which will <a href=\"https://dlewis.io/evaluating-llama-33-70b-inference-h100-a100/\">fit</a> comfortably (unquantized) on four A100s at around 2M tokens per hour. At industrial power prices, that’s about 13c/hr in the <a href=\"https://www.eia.gov/electricity/monthly/update/end-use.php\">USA</a>. Suppose (pessimistically) cooling is the same cost. That’s about 13 cents per million output tokens<sup id=\"fnref-2\"><a href=\"#fn-2\" class=\"footnote-ref\">2</a></sup>.</p>\n<p>Let’s amortize the cost of the GPUs, since that’s going to be the most expensive part. An A100 costs about $20k. If each A100 lasts around five years<sup id=\"fnref-3\"><a href=\"#fn-3\" class=\"footnote-ref\">3</a></sup>, you’ll have to make 16k/yr in profit to recoup your capital investment (or $1.80 per hour). At lower utilization, it’ll take longer to recoup, but your GPUs will also last longer. Either way, your overall inference costs are at about one dollar per million tokens.</p>\n<p>GPT-5.4-mini <a href=\"https://openai.com/business/pricing/#api\">charges</a> $4.50 per million tokens, and stronger OpenAI or <a href=\"https://platform.claude.com/docs/en/about-claude/pricing\">Anthropic</a> models are three to six times as expensive. It’s hard to make a direct comparison because we don’t know the size of OpenAI or Anthropic models, but the claimed 70% or 80% profit margin is extremely plausible.</p>\n<h3>Open LLMs demonstrate that inference is profitable</h3>\n<p>What if you don’t trust my estimates either? Let’s look at the pricing of open-weights Chinese LLMs. DeepSeek have <a href=\"https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md\">claimed</a> a bit over 80% profit margin on inference for DeepSeek-R1. Since their API pricing for R1 is less than half that of OpenAI or Anthropic<sup id=\"fnref-4\"><a href=\"#fn-4\" class=\"footnote-ref\">4</a></sup>, that suggests that my estimates above for inference cost might be too expensive. Cooling at scale is probably <a href=\"https://massedcompute.com/faq-answers/?question=What%20are%20the%20estimated%20annual%20power%20consumption%20costs%20of%20NVIDIA%20A100%20and%20H100%20GPUs%20in%20a%20typical%20data%20center?\">cheaper</a> than power, R1 only has half the active parameters of a dense 70B model, modern GPUs are more efficient than the A100, and there are significant <a href=\"/inference-batching-and-deepseek/\">economies of scale</a> in inference.</p>\n<p>Since DeepSeek’s models are available for anyone to download, they can’t get away with extracting a large profit margin. One of the other inference providers would undercut them with the same model. Inference costs for DeepSeek-V4-Pro on the market are around 87 cents per million output tokens, which is probably pretty close to the actual cost of serving the model.</p>\n<h3>For AI labs, inference must subsidize training</h3>\n<p>All of this doesn’t mean that <em>OpenAI</em> or <em>Anthropic</em> are profitable. Those companies are making huge capital <a href=\"https://openai.com/index/building-the-compute-infrastructure-for-the-intelligence-age/\">investments</a> that may or may not pan out, and are spending enormous amounts of money on talent and compute to train brand-new models and retain users.</p>\n<p>They’re doing crazy things like offering per-month subscription models for nearly unlimited inference, which is almost certainly not profitable. If you used an API token instead of your Anthropic subscription in Claude Code, you’d pay ten times the cost. But that doesn’t mean API-based Claude Code couldn’t be a good deal. Some people are <a href=\"https://www.reddit.com/r/opencodeCLI/comments/1tril88/test_of_prices_of_deepseek_in_opencode_go_and_api/\">already using</a> DeepSeek’s inference API for agentic coding, because once you take away the huge profit margin it’s cheaper than the relative per-month subscription.</p>\n<p>Why won’t OpenAI or Anthropic lower their prices? Supposedly OpenAI has <a href=\"https://www.wsj.com/tech/ai/openai-considers-drastic-price-cuts-anticipating-war-for-users-with-anthropic-9b8c178e\">thought about it</a>, but for an AI lab, <strong>inference has to subsidize training costs</strong>. A company like OpenAI has to fund the production of new models from the inference margins on existing models (at least partially). That’s why the margins on inference are so high: the AI labs are trying to squeeze out every dollar so they can stay alive in the training arms race.</p>\n<p>However, inference only has to subsidize training costs <strong>for an AI lab</strong>. If you’re merely an inference provider, you don’t have to do any training at all. Therefore, even if OpenAI and Anthropic go out of business, whoever snaps up the rights to their frontier models will be able to continue selling Opus and GPT inference at a profit<sup id=\"fnref-5\"><a href=\"#fn-5\" class=\"footnote-ref\">5</a></sup>. The AI bubble popping will not mean the end of the inference business, because <strong>AI inference is obviously profitable</strong>.</p>\n<div class=\"footnotes\">\n<hr>\n<ol>\n<li id=\"fn-1\">\n<p>Expensive frontier models are probably mixture-of-experts, not dense, which is tougher to estimate. However, I think a 70B dense model and a MoE with 70B active params will come out to basically the same numbers at scale (though the MoE will require more GPU memory and thus a greater upfront cost). Are frontier models around 70B params? Nobody outside the AI labs really knows, but my guess is that 70B is probably larger than a Haiku/mini class model.</p>\n<a href=\"#fnref-1\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-2\">\n<p>I think it’s reasonable to estimate the cost of output tokens only, since they’re by far the most expensive part of serving inference. Input tokens are cheaper for two reasons: transformers let you prefill them in parallel, and for most real-world use cases they can be aggressively cached in the KV cache.</p>\n<a href=\"#fnref-2\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-3\">\n<p>It’s common (and wrong) to estimate GPU lifespan at three years. I wrote a lot about this in <a href=\"/ai-gpus-live-longer-than-three-years/\"><em>AI GPUs probably live longer than three years</em></a>. </p>\n<a href=\"#fnref-3\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-4\">\n<p>Again, this is just an guess, since we don’t know what OpenAI or Anthropic model is equivalent in size to R1.</p>\n<a href=\"#fnref-4\" class=\"footnote-backref\">↩</a>\n</li>\n<li id=\"fn-5\">\n<p>I do wonder if Anthropic would be able to prevent other people from being able to access the model if the company goes out of business. Anthropic is currently in <a href=\"https://www.bloomberg.com/news/articles/2026-06-02/broadcom-backing-lowers-debt-costs-on-36-billion-anthropic-deal\">debt</a> to Broadcom, Google, and a bunch of private equity firms. Would they get the Mythos and Opus weights, over Dario’s protestations? </p>\n<a href=\"#fnref-5\" class=\"footnote-backref\">↩</a>\n</li>\n</ol>\n</div>","frontmatter":{"title":"AI inference is obviously profitable","description":null,"date":"June 26, 2026","tags":["ai","bubble"]}}},"pageContext":{"slug":"/ai-inference-is-obviously-profitable/","previous":{"slug":"/ai-gpus-live-longer-than-three-years/","title":"AI GPUs probably live longer than three years"},"next":null,"preview":{"slug":"/ai-gpus-live-longer-than-three-years/","title":"AI GPUs probably live longer than three years","snippetHtml":"<p><a href=\"https://www.wheresyoured.at/ai-is-slowing-down\">People</a> who think current AI use is unsustainable often rely on the <a href=\"https://www.tomshardware.com/pc-components/gpus/datacenter-gpu-service-life-can-be-surprisingly-short-only-one-to-three-years-is-expected-according-to-unnamed-google-architect\">claim</a> that inference GPUs only last “three years at the most” under load. The idea here is that once the AI bubble money drains away, current infrastructure will rapidly become obsolete, and there won’t be enough money floating around to buy a whole slate of brand-new GPUs. Inference costs would thus rapidly become way too expensive for current AI products to make any financial sense.<br /><a href=\"/ai-gpus-live-longer-than-three-years/\">Continue reading...</a></p>"}}},"staticQueryHashes":["1146911855","3764592887"]}