OpenAI’s compute margin hit ~70%—but B2B AI apps still face rising per-task costs. Learn the margin tactics that actually work.

AI compute margins: why startups still feel squeezed
OpenAI reportedly pushed its compute margin to ~70% in October 2025, up from ~35% in January 2024. That’s not a rounding error—it’s the difference between “this looks like a services business” and “this is starting to look like software economics.”
But here’s the part most founders don’t want to hear: better foundation-model margins don’t automatically mean better margins for AI apps. If you’re building AI-powered digital services in the U.S.—especially B2B SaaS—the treadmill is still running. Your customers expect reasoning, agents, and deep workflow automation, and those features often increase total tokens per task by 10× to 100× even as per-token prices fall.
This post is part of our “AI in Cloud Computing & Data Centers” series, so we’ll stay grounded in the real driver behind all of this: infrastructure economics—GPU utilization, model routing, workload management, and how cloud spend becomes (or destroys) margin.
OpenAI’s 70% compute margin is real—and it changes the baseline
The simplest way to interpret the reported 70% compute margin is: the cost to serve a dollar of model revenue has dropped sharply at the foundation layer. That matters for the U.S. tech ecosystem because it signals that large-scale AI services are becoming operationally sustainable—at least for certain model tiers and traffic patterns.
What improved margins usually mean in the data center
At a practical level, margin improvements like this typically come from a mix of:
- Higher GPU utilization (less idle time, better batching, smarter scheduling)
- Model and systems optimization (kernels, quantization, caching, speculative decoding)
- Traffic shaping and tiering (priority tiers, throttling, routing by latency/quality)
- Hardware and procurement leverage (better deals, longer amortization, custom stacks)
If you run a cloud platform team, this is familiar: the “AI tax” is mostly a resource allocation problem. The labs that win are the ones that treat inference like a first-class data center product, not a research demo.
Why this doesn’t “trickle down” to B2B SaaS automatically
The foundation layer can improve margins while application companies stay stuck because:
- Price drops hit older models first (the ones customers stop accepting)
- Frontier features consume more compute per request (reasoning, tool use, multi-step planning)
- Competitive pressure forces upgrades even when the current product “works fine”
So yes, the baseline is improving. But the application layer has to actively convert those improvements into unit economics.
The treadmill problem: per-token gets cheaper, per-task gets pricier
The core dynamic crushing many AI software gross margins is straightforward:
Per-token costs are falling. Total tokens per task are rising faster.
Agentic workflows and reasoning modes tend to:
- generate intermediate “thinking” output,
- run multiple sub-queries,
- call tools and re-rank results,
- and retry when confidence is low.
The RSS article highlights a concrete pattern: a reasoning-heavy model can produce ~603 tokens vs ~60 tokens for a simpler model on a similar task—a 10× jump caused by token bloat, not value.
What this looks like inside AI-powered SaaS
I’ve seen this show up in product analytics as “quality improvements” that quietly explode cost:
- A support copilot that goes from single response → multi-step troubleshooting
- A sales assistant that goes from summary → account research + email + follow-up plan
- A dev tool that goes from autocomplete → plan + code + tests + refactor
Each step feels like a better user experience. But each step is also more inference.
If you don’t have cost controls tied to workflows, you end up shipping margin erosion as a feature.
AI gross margins in B2B: what “healthy” looks like now
Traditional SaaS investors loved 75–80% gross margins because hosting costs were modest and scaled nicely. AI software breaks that assumption.
The RSS content cites Bessemer’s 2025 patterns:
- Fast-growing AI “Supernovas” averaging ~25% gross margin early
- More stable AI companies closer to ~60% gross margin
- Some AI companies with negative gross margins (rare in classic software)
That’s a huge reframing for U.S. startup operators.
A practical gross margin benchmark for AI apps
For many AI-enabled B2B products in 2026 planning cycles, a more realistic set of internal targets looks like:
- Early scale (0–$5M ARR): 30–55% GM if you’re still learning usage
- Growth ($5M–$25M ARR): 55–70% GM if routing/pricing is under control
- At scale: 70%+ GM only if you have defensible non-inference margin or proprietary inference
The key isn’t hitting a perfect number today. It’s whether your margin improves with volume or gets worse.
Cloud economics: why AI changes workload management priorities
AI inference doesn’t behave like typical web SaaS traffic. In data centers, it introduces constraints that alter how you plan capacity and reliability.
Inference is “costly latency”
With standard SaaS, you can often buy performance with caching, CDNs, and stateless scaling. With LLM inference:
- Low latency can require more parallelism (more GPUs per unit output)
- Spiky demand creates expensive headroom
- Tail latency can trigger retries, multiplying cost
This is why AI infrastructure optimization is now a board-level topic. Gross margin is increasingly determined by whether you can keep GPUs busy while meeting latency SLOs.
The hidden data center bill: failures and retries
One under-discussed margin killer is the cost of “non-productive tokens”:
- tool-call failures
- rate-limit fallbacks
- prompt bugs that cause loops
- guardrail false positives that trigger reruns
If you don’t instrument “tokens per successful task,” you’ll misread your spend.
Four tactics that actually improve AI gross margin (without killing quality)
The application layer can win, but only with an explicit margin strategy. These are the approaches that consistently work.
1) Intelligent model routing (your margin engine)
The best AI SaaS companies treat model choice like a real-time decision, not a product philosophy.
A routing layer typically:
- sends low-risk tasks to cheaper models
- escalates only when confidence is low
- uses evaluation scores (not vibes) to decide
A simple policy I like:
- Default: “good enough” model
- Escalate: only when the user is about to act (send, publish, deploy, close)
Routing is the difference between “AI feature” and “AI business.”
2) Pricing that matches the cost curve (unlimited is a trap)
If your COGS varies per customer, your pricing has to vary too. That’s why many AI companies now use mixed pricing (subscription + usage) and why “unlimited” tiers are disappearing.
A workable structure:
- Base subscription includes a healthy allowance
- Overage pricing is transparent and aligned to work units (tasks, actions), not raw tokens
- Heavy users move into a higher tier with better controls
If you can’t explain your cost drivers to a customer, you can’t sustainably charge for them.
3) Reduce tokens per task (workflow design beats prompt tweaks)
Most teams focus on prompt engineering. The bigger wins come from workflow redesign:
- Ask for less: constrain output formats, limit verbosity
- Reuse more: caching, shared embeddings, templated plans
- Stop early: don’t do “deep reasoning” if the user just needs a draft
Measure:
- tokens per successful task
- tokens per retained customer
- tokens per dollar of ARR
Those metrics create the right product behavior.
4) Proprietary or fine-tuned models (only when the math forces it)
The RSS article’s Cursor example is instructive: when API costs become existential, some companies build in-house models to regain margin control.
Most B2B startups won’t replicate a nine-figure training program. The pragmatic version is:
- fine-tune an open model for your narrow workflow
- serve it for the “common case”
- keep a frontier model for edge cases
That hybrid approach often delivers the biggest margin lift per engineering hour.
A margin sanity-check you can run in one afternoon
If you’re running AI-powered digital services, here’s a quick framework to assess whether you’re on the treadmill.
Step 1: Compute “AI COGS per active account”
Track monthly:
- model/API spend
- inference-related cloud spend (GPUs, vector DB, tool infra)
- divided by active accounts
If that number rises faster than ARPA, your margins will compress.
Step 2: Break spend into “baseline” vs “premium quality”
Label traffic:
- baseline model requests
- reasoning/agent requests
- retries/fallbacks
Then ask: Which bucket is growing and why?
Step 3: Decide what you’re willing to subsidize
Pick one:
- subsidize acquisition (short-term) with a clear payback plan
- subsidize power users (rarely wise)
- subsidize premium tiers (fine, if you charge for it)
Indecision is the expensive option.
What this means for U.S. tech growth in 2026
OpenAI’s reported compute margin jump is a positive signal for the U.S. AI economy: it suggests the infrastructure layer is learning how to run AI services more efficiently inside modern data centers.
But for startups, the lesson is blunt: AI gross margin is a product decision, not a finance outcome. If you don’t build cost-aware workflows, model routing, and pricing that reflects usage, you’ll keep upgrading into higher costs—even while the underlying technology gets cheaper.
If you want help diagnosing your current AI COGS and designing a routing + pricing plan that fits your customers (without turning your product into a meter-running nightmare), that’s the moment to bring in an outside set of eyes. Most teams wait until gross margin becomes a board emergency.
The forward-looking question heading into 2026: Will your company be one of the AI apps that converts data center efficiency into profit—or one that burns it chasing the next model upgrade?