The closed-source LLM premium has collapsed
Open-source models now match closed-model benchmarks at 87% lower cost. Here's what that means for developers choosing between open source inference and OpenAI.
What was your first call to an LLM? Almost definitely, something like this:
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.responses.create({
model: "gpt-5.5",
input: "Write a short bedtime story about a unicorn.",
});
console.log(response.output_text);The entry point to LLMs is through the frontier. This is where everyone starts, but it’s also where too many stay. They build entire apps, workflows, and harnesses around variations on this call, then wince at their API bill come end-of-month.
That used to make sense when the proprietary models were so dominant. If you want your product to excel, the quality of the underlying model must be high. And high quality has always meant proprietary. So developers ate the cost for excellence.
Does that still hold? Not really. Yes, proprietary is still the frontier, but what was the frontier 18 months ago is now well-mapped territory, and open-source models are closing the gap at a fraction of the cost.
Why? And what should developers make of, and with, these newfound lands of open source?
What kept open source on the bench
We should be clear. Open models aren’t close to taking over from proprietary. There is still a lag in uptake. Open source models are 87% cheaper at equal intelligence but still hold only 25-30% of the token share.
But that gap is no longer about quality. For years, the buying decision was binary: pay proprietary prices, or accept a worse model. Serious teams paid up. That habit is what keeps closed models dominant today, long after the quality gap that justified it closed.
Let’s take GPT-4 as our example. When it launched in 2023, GPT-4 was at $30/M tokens, but it was also the only model that could actually do the work.
| Benchmark | GPT-4 | Llama 2 70B |
|---|---|---|
| MMLU (knowledge) | 86.4% | 68.9% |
| HumanEval (code) | 67.0% | 29.9% |
| GSM8K (math) | 92.0% | 56.8% |
(Sources: GPT-4 Technical Report and the Llama2 paper)
Llama 2 wasn’t wildly behind the curve, but it was far enough behind to be a risk for anything production-grade. HumanEval was the rough one: a 37-point gap meant Llama 2 wasn't a real option for anything code-adjacent.
When you extrapolate this to all models, the trend is clear: Closed models are “better”.
They are also more expensive. This was the in for OS models, but it came with a catch. Whatever you saved on tokens, you paid back in GPU plumbing. Llama 2 70B at fp16 needed roughly 140GB of VRAM, so two H100s at a minimum, or a more painful quantized setup with its own quality tradeoffs. Add to that the engineering resources to wire together your serving stack, and the costs start to equalize.
But perhaps the biggest reason closed models won was that they weren’t just models. Closed labs shipped a steady drumbeat of product alongside the weights. Function calling, structured output, vision, file uploads, batch processing, fine-tuning, all wired together and versioned.
Here's a timeline of OpenAI releases over 18 months:
- Function calling, June 2023
- JSON mode, November 2023
- Vision (GPT-4V), November 2023
- Batch API, April 2024
- Structured outputs, August 2024
- Prompt caching, October 2024
- Realtime API, October 2024
Open source gives you weights. Everything that turns a model into a product, you build yourself.
But if there is one thing AI has taught us, it is that today isn’t tomorrow. A year ago, no one was coding with agents significantly. Now, no one is coding without them. AI is in constant flux, and underneath all this frontier progress, the economics keep drifting. Inference got roughly 10x cheaper per year. The capability lag compressed from 18 months to a few. Then three things hit at once:
- Open source started landing on frontier benchmarks.
- Frontier labs started raising prices.
- Agents broke the simple math.
The performance margin has disappeared
Let's take a look at Kimi.
Moonshot AI shipped Kimi K2 in July 2025, then Kimi K2.5 in February 2026, then Kimi K2.6 in April. The current version is a 1-trillion-parameter MoE with 32B active. The benchmarks land in territory that was Opus-only six months ago.
The model behind the experience
Kimi K2.6 is Moonshot AI's latest open-source model, with state-of-the-art coding, long-horizon execution, and agent swarm capabilities.
General Agents
Humanity's Last Exam (Full) w/ tools
BrowseComp
DeepSearchQA (f1-score)
Toolathlon
OSWorld-Verified
Coding
Terminal-Bench 2.0 (Terminus-2)
SWE-Bench Pro
SWE-Multilingual
Visual Agents
MathVision w/ python
V* w/ python
Kimi K2.6
GPT-4o
Claude 3.5 Sonnet
Gemini 1.5 ProKimi is leading or keeping pace with top models. It's a model you can download, run on your own hardware, and deploy to production for the same tasks the frontier handles.
And the catch-up isn't unique to Kimi. DeepSeek V4 Pro, GLM-5.1, Qwen 3.6, and Mistral Medium 3.5 all shipped frontier-tier benchmarks in Q1 2026. Why?
- Open labs are riding the slipstream. Frontier closed models do the expensive exploration. Open labs distill trajectories, learn from synthetic data, and post-train against patterns the frontier has already proven out. The first model to solve a problem pays the full cost. The second model pays a fraction.
- The architecture playbook is now public. MoE routing, long-context tricks, test-time reasoning, agent harnesses. Three years ago, these were lab secrets. Now they're papers, blog posts, and reference implementations on Hugging Face. Once a technique is in the open, the gap to implement it is weeks, not quarters.
- Compute is no longer the bottleneck it was. Training a frontier-class model in 2023 took an OpenAI-sized cluster. In 2026, a well-funded lab with a few thousand H100s can ship a competitive model in a single quarter. DeepSeek did it. Moonshot did it. Zhipu did it. The barrier dropped enough that "frontier-class" is no longer a one-company achievement.
The labs that ship these models also depend on a long tail of inference providers ready to host them on day one. Distillation pipelines, benchmark validation, and developer adoption all run through that layer.
There are still gaps. Closed models lead on the hardest reasoning and on the polish that comes from years of RLHF and red-teaming. But the lag is small, and the open labs say so themselves. DeepSeek themselves put their own V4 models 3 to 6 months behind the state-of-the-art frontier, beating last generation's flagships while trailing the current ones. For most production work, a few months of lag on the hardest problems doesn't change the decision.
Tokens are going to zero
Proprietary pricing is messy right now, and getting messier. The frontier labs don't really know where to set prices, and the last six months have been them figuring it out in public.
The trigger is agentic usage. A chat call burns a few hundred tokens. An agent can easily burn through millions. Claude Code Max users were extracting around $5,000 in usage from $200 monthly plans. Even subsidized, flat-rate subscriptions don't survive a delta like that.
How is pricing shaking out? Users are seeing:
- Explicit hikes. GPT-5.5 launched at roughly 2x the per-token cost of its predecessor.
- Stealth hikes. Opus 4.7 kept the same sticker price, but its new tokenizer generates up to 35% more tokens for the same prompts. Same rate, higher bill.
- Access changes. Codex shifted to per-token billing. Anthropic cut off OpenClaw from Claude subscriptions. Google added spend caps on the Gemini API.
The goodwill built through the “product” model above is undone bit by bit every time a developer suddenly finds their access locked or their limits hit. Meanwhile, the floor under open-source pricing continues to drop. MiniMax M2.7 on Runware runs at $0.30/$1.20 per million input/output tokens. For comparison, Opus 4.7 is $5/$25, and GPT-5.5 is $5/$30. Agents are output-heavy, and output is where the gap is widest. A non-trivial coding task can easily run through hundreds of thousands of output tokens. At GPT-5.5 ($30/M out), each task costs several dollars. At M2.7 ($1.20/M out), it's pennies.
Per-token billing is the right model for agentic workloads, and the labs know it. Done right, it means paying for the seconds of inference you actually run, no minimums, no commitments, no rounding up. The frontier labs aren't there yet.
But their customer base anchored on "flat rate, unlimited," and they can't walk that back cleanly. So they're raising prices without raising prices, restricting access without restricting access, and hoping no one tallies the cumulative effect.
Harnesses are becoming more important than models
Lock-in to a single model is dissipating. Agents must route across models and modalities for any task at hand.
This is a fundamental premise of Runware. We are a single destination for 400k+ models across modalities, making model choice a simple config decision. This needs to happen for two reasons.
First, frontier agents can use multiple models to perform tasks.
- Claude Code uses a main model to do the heavy lifting, a small model handles cheap background jobs like the one-line summaries of your sessions, and a separate model answers the side-questions feature so a quick question mid-task doesn't interrupt the main model.
- A message to ChatGPT doesn't hit one model. A small router reads the request first and decides where it goes: a quick factual question to the fast model, a hard reasoning or coding task to the deeper one. The router doesn't reason or generate anything itself. It dispatches. Easy work goes to the cheap model, hard work to the expensive one, so you stop paying frontier rates for trivial requests.
The dispatcher can be cheap and route to expensive models to do the work. Once model choice is just a configuration, you can build the same thing on open source, with one addition the closed products don't offer: you can pick models by what they're good at, not just by size.
- Reasoning and planning. Breaking a request into steps and deciding when it's done wants the strongest reasoning available. That can now be an open-source model like DeepSeek V4 or Kimi K2.6, both within a few months of the closed frontier.
- Code generation. Turning a clear spec into a working file suits a coding-tuned model. Something in the Qwen family handles it at a fraction of the reasoning model's cost.
- The conversational layer. Whatever the end user talks to benefits from a more conversational model, such as Gemma.
This works much better for cost. A chat turn that lands on a conversational model never pays frontier rates, and neither does a code generation step that lands on a coding model. As long as each model is capable at its own job, your effective per-request cost is one small-to-medium model, well below a single large model doing every job itself.
Which brings us to the second reason: workflows necessitate multimodality. Most useful agents touch more than text. Say you're building an agent that turns a written product brief into a 60-second launch video. The pipeline hits:
- A text model for the script.
- A TTS model for the voiceover.
- An image model for background visuals.
- A video model for b-roll.
- A text model again for captions.
Without a unified API, the agent is wired into five separate services, each with its own SDK, auth flow, rate limit, billing pipeline, and error semantics. That's a layer of integration your team writes once and maintains forever. A unified API across modalities removes that layer.
The long-term shape of this is agents deciding which models to use for their task lists in an LLM market economy. The developer writes the initial spec. After that, the agent shops the spread across providers in real time. Model capability becomes a commodity.
Open source will build better infrastructure
Once the model question is settled, the next question is latency and placement.
Agents amplify latency in a way that chat never did. A user opening a support chat notices 500ms once. With an agent making 50 sequential calls to plan, route, and execute, that 500ms compounds into a 25-second delay the user actually waits through. Multiply by the cold-start, retrieval, and tool-call hops in a real agent loop, and round-trip latency starts to dominate the user experience.
Vertical integration will be the answer to bringing costs down while increasing speed. Owning the boards, servers, and orchestration end-to-end beats commodity GPU clouds on utilization, which is where most of the cost actually sits. A custom inference stack on hardware you control gives you headroom on both axes:
- Lower marginal cost per token. On-demand hyperscaler H100 rental costs about $7/hour, while buying chips and running them directly puts the equivalent rate at about $1.60 for Runware. Idle capacity from one workload also becomes usable seconds for another, rather than sitting paid for and empty.
- Lower latency on the calls that matter. Pods placed close to traffic beat hyperscaler regions on round-trip time. Tuning the stack purely for inference (high-frequency CPUs, disabled hyperthreading, custom PCIe topology) also yields more performance per chip. That compound effect carries through every step in an agent loop.
Frontier labs are heading the other way. To make the compute economics work at their scale, they sign multi-year deals with the hyperscalers: OpenAI on Azure, Anthropic on AWS and GCP. That locks them into the datacenter model for years. The upstream is volatile enough that Architect have launched compute futures on H100 and H200 prices. The input to every hyperscaler API endpoint is now a hedged commodity.
Modular compute is what unlocks the alternative. A pod in a shipping container, dropped near the traffic, routes around the 2-to-4-year wait for new AI data center capacity. A pod with power, cooling, and an uplink is enough. The placement decision drops from quarters to weeks. For scale: xAI's 300 MW Colossus build took four months and depended on rented power generators and a large share of the mobile cooling capacity available in the US. A factory shipping containerized inference pods can deploy equivalent compute in days, with everything owned and water-cooled in place.
The regulatory case is the second tailwind. Multiple frameworks push in the same direction:
- GDPR and the EU AI Act: inference inside European borders, with auditable controls.
- FedRAMP and CMMC: US federal and defense workloads, with explicit hardware security postures.
The answer that satisfies all of these is a regional pod running open-source models you can audit. A US hyperscaler endpoint serving Frankfurt traffic is not the answer, no matter how good the model is.
Today is closed; tomorrow is open
Open-source models are now good enough for most production work, and frontier closed models are repricing to cover their actual costs.
The frontier will still be there. Closed labs will keep pushing the leading edge, and there will keep being workloads where that edge is worth paying for. But the share of work that needs the frontier is shrinking, and the share that runs well on an open-source model with the right infrastructure is growing.
The meaningful decision has moved up the stack. The platforms that win the next phase will be the ones that put open source first, expose every modality behind a single API, charge for the inference you actually run, and operate on hardware designed entirely for this purpose and placed close to where you serve traffic from.
The model is becoming the easy part. The stack underneath it is where the next few years of competitive advantage live.
Where to start
Runware is building exactly this. One API spans every leading open and closed LLM alongside image, video, audio, and 3D, billed per token with no subscriptions. Open-source models run on our own hardware at up to 80% lower cost, and you pick each model by what it does best.
See it in full on the LLM API page, with live pricing and the open-versus-closed comparison drawn from current benchmarks. When you are ready, get an API key and run your first open-source model in minutes. For committed-use rates and dedicated capacity, talk to us.
