LLM API Pricing 2026 — Compare GPT-5, Claude 4, Gemini 2.5, DeepSeek Costs
April 2026: GPT-5.4 $2.50/M, Claude Sonnet $3/$15, Gemini Flash $0.30, DeepSeek $0.14. Compare 30+ LLM prices. Find the cheapest API for your app.
LLM API Pricing Comparison — April 2026
Last updated: April 9, 2026 — Claude Opus 4.6 now available with 1M context
TL;DR — LLM API Pricing as of April 2026
- Cheapest: Gemini 2.0 Flash-Lite — $0.075/$0.30 per 1M tokens
- Best Value: DeepSeek V3.2 — $0.28/$0.42 per 1M tokens
- Best Overall: GPT-5.4 — $2.50/$10 per 1M tokens
- Best Mid-tier: Claude Sonnet 4.6 ($3/$15) or GPT-5.2 ($1.75/$14)
- Premium: Claude Opus 4.6 — $5/$25 (now with 1M context) | GPT-5.2 Pro — $21/$168
- Free Tier: Google Gemini (free on most models)
Two years ago, running a flagship LLM cost $10 per million input tokens. Today you can get a better model for a quarter of that price — and a perfectly adequate one for a hundredth. The collapse in inference costs has reshaped what's economically feasible with AI, from side-project chatbots to enterprise-scale document processing pipelines that chew through millions of pages a month.
DeepSeek blew up the pricing floor. OpenAI responded with aggressive cuts across the GPT-5 family. Google dangled free tiers that actually work. Anthropic dropped Opus pricing by 67% and expanded its context window to 1M tokens. The result: choosing the wrong model for your workload can cost you 100x more than necessary for the same quality output.
This guide covers every major LLM API with current pricing, real cost examples, and practical advice on which model to pick. We track official rates from all providers and update weekly. If you want the full story on why costs cratered so fast, read our breakdown of how inference economics collapsed 150x in under two years.
Quick Answer: Which Model Should You Use?
Before diving into the full tables, here's what most developers actually need. The right model depends on what you're building, not what scores highest on benchmarks. A model that aces MMLU might still be the wrong pick if it doubles your API bill for marginal quality gains your users never notice.
- Cheapest option that works: Gemini 2.0 Flash-Lite at $0.075/$0.30 per million tokens. Hard to beat for simple classification, routing, and light extraction tasks.
- Best bang for the buck: DeepSeek V3.2 at $0.28/$0.42. Surprisingly capable for the price, with 90% cache discounts that make repeated-context workloads almost free.
- Best mid-tier all-rounder: Claude Sonnet 4.6 ($3/$15) or GPT-5.2 ($1.75/$14). Both handle complex tasks well. Sonnet follows nuanced instructions more faithfully; GPT-5.2 is cheaper and faster.
- When you need the absolute best: Claude Opus 4.6 ($5/$25) or GPT-5.2 Pro ($21/$168). Reserve these for tasks where accuracy directly impacts revenue or safety. Opus 4.6 now supports 1M token context, putting it in the same league as Gemini 2.5 Pro for long-document analysis — at a fraction of the per-token premium.
- Free and open: Llama 4 and Gemini's free tier cost nothing for prototyping. Start here if you're exploring and don't want to worry about bills.
The spread between cheap and premium now exceeds 1,000x. A request that costs $0.0001 on Mistral Nemo runs $0.10+ on GPT-5.2 Pro. That makes model selection one of the highest-leverage decisions in any AI product. Getting this wrong by even one tier can mean the difference between a profitable feature and one that bleeds cash every month.
All Provider Pricing (April 2026)
OpenAI
Source: openai.com/api/pricing
OpenAI runs the broadest lineup of any provider, stretching from the ultra-cheap GPT-5 Nano all the way up to the research-grade GPT-5.2 Pro. The March 2026 launch of GPT-5.4 added a new flagship tier that sits between GPT-5 and GPT-5.2 — strong reasoning at a competitive price. OpenAI also maintains its o-series reasoning models, with o3 and o4-mini now the recommended options as o1 and o1-pro get phased out.
The real story at OpenAI this year is market segmentation. GPT-5 Nano at $0.05/M input competes directly with Gemini Flash on price while living inside the OpenAI ecosystem — same API, same tooling, same billing. That matters for teams already invested in OpenAI's function calling format and structured output modes. Switching providers to save a few cents per million tokens rarely makes sense when you factor in engineering time for integration changes.
Between GPT-5 Nano and GPT-5.2 Pro, OpenAI offers six distinct price points. That granularity lets you match cost to capability with precision. A customer-support bot that handles FAQs can live on Nano at $0.05/M, while the same team's research agent burns through GPT-5.2 Pro tokens at $21/M — both on the same billing account, same SDK, same retry logic. No other provider makes that kind of vertical stacking as painless.
| Model | Input/M | Output/M | Cached Input/M | Context | Best For |
|---|---|---|---|---|---|
| GPT-5.2 Pro | $21.00 | $168.00 | $2.10 | 200K | Hardest reasoning tasks |
| GPT-5.2 | $1.75 | $14.00 | $0.175 | 200K | Coding, agents |
| GPT-5.4 | $2.50 | $10.00 | $0.25 | 200K | New flagship, balanced |
| GPT-5 | $1.25 | $10.00 | $0.125 | 128K | General flagship |
| GPT-5 Mini | $0.25 | $2.00 | $0.025 | 200K | Fast, affordable |
| GPT-5 Nano | $0.05 | $0.40 | $0.005 | 128K | High-volume simple tasks |
| o4-mini | $1.10 | $4.40 | $0.275 | 200K | Best value reasoning |
| o3 | $2.00 | $8.00 | $1.00 | 200K | Mid-tier reasoning |
| o3-pro | $20.00 | $80.00 | — | 200K | Strong reasoning |
| o1 | $15.00 | $60.00 | $7.50 | 200K | Legacy reasoning |
| GPT-4.1 | $2.00 | $8.00 | $0.20 | 1M | Previous gen |
| GPT-4.1 Mini | $0.40 | $1.60 | $0.04 | 1M | Previous gen budget |
| GPT-4.1 Nano | $0.10 | $0.40 | $0.01 | 1M | Previous gen fast |
OpenAI's Batch API gives 50% off all models for async workloads processed within 24 hours. Cached input tokens cost roughly 10% of the standard input price across the board. For teams processing large batches of documents overnight, stacking batch + caching can cut costs by 90%+ compared to synchronous, uncached calls. The GPT-4.1 family still offers a compelling 1M token context window at mid-range prices, which makes it a reasonable choice for teams that need long-context support without paying Gemini 2.5 Pro rates.
For a full breakdown of every OpenAI model and tier, see our dedicated OpenAI API pricing guide.
Anthropic
Source: claude.com/pricing
Anthropic keeps things simple with a lean three-model lineup. The big news this spring: Opus 4.6 now ships with a 1M token context window, matching GPT-4.1 and approaching Gemini 2.5 Pro territory. That expansion makes Opus a legitimate choice for long-document analysis and multi-file codebase reasoning — use cases that previously required Gemini or chunking strategies. Opus 4.6 also dropped 67% from the previous Opus 4.1 pricing ($15/$75 down to $5/$25), making it competitive with GPT-5.2 on cost while maintaining Anthropic's edge in instruction-following and safety.
Claude Sonnet 4.6 remains the workhorse for most production applications — strong at coding, analysis, and structured output. Haiku 4.5 covers the budget tier for classification, chat routing, and extraction. Where Anthropic really differentiates is behavior consistency: Claude models tend to follow complex, multi-constraint instructions more reliably than competitors, which reduces the number of retries and manual overrides you need in production. That reliability gap is hard to measure on benchmarks but shows up clearly in error rates and support tickets.
Anthropic recently passed $30 billion in annualized revenue run-rate, a sign that plenty of teams have concluded Claude's pricing is worth the premium. The instruction-following advantage becomes more pronounced on agentic workflows where a model needs to stay on task across dozens of tool-use loops. A model that drifts off-plan on loop 12 doesn't just produce bad output — it wastes the tokens you spent on the first 11 loops too. For a deeper look at Anthropic's pricing tiers and when each model makes sense, see our Anthropic API pricing guide.
| Model | Input/M | Output/M | Cached Input/M | Context | Best For |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 | 1M | Complex analysis, research, long docs |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 200K | Coding, balanced tasks |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 200K | Fast classification, chat |
Batch API saves another 50%. Prompt caching saves 90% on input tokens, and those discounts stack — cached batch input for Opus runs at roughly $0.25/M, which is 95% below the list price. Legacy Claude 3 Haiku ($0.25/$1.25) is deprecated and retiring in April 2026. If you're still running workloads on it, migrate to Haiku 4.5 before the cutoff.
Google Gemini
Source: ai.google.dev/pricing
Google has the widest range of pricing and the most generous free tier in the industry. Gemini 2.5 Pro's 2M token context window remains the largest commercially available option (tied with Grok 4), and the tiered pricing means you only pay the higher rate when you exceed 200K tokens. The preview releases of Gemini 3.1 Pro and Gemini 3 Flash hint at where Google is heading — more capable models at the same price points.
Google's free tier deserves special attention because it's genuinely useful, not just a marketing gimmick. You get roughly 1,000 requests per day on Gemini 2.5 Flash, Flash-Lite, and 2.0 Flash — enough for internal tools, prototyping, and low-traffic production apps. Several startups have shipped their MVP entirely on Google's free tier for months before upgrading. That kind of zero-risk experimentation accelerates adoption in a way that $5 minimum commitments can't match. Gemini also stands apart on multimodal pricing: image, audio, and video input tokens are priced at the same rate as text, which makes it the default choice for workflows that mix modalities.
| Model | Input/M | Output/M | Cached Input/M | Context | Best For |
|---|---|---|---|---|---|
| Gemini 3.1 Pro (preview) | $2.00 (≤200K) / $4.00 | $12.00 (≤200K) / $18.00 | — | 200K+ | Next-gen flagship |
| Gemini 3 Flash (preview) | $0.50 | $3.00 | — | — | Fast next-gen |
| Gemini 2.5 Pro (≤200K) | $1.25 | $10.00 | $0.125 | 2M | Long documents, analysis |
| Gemini 2.5 Pro (>200K) | $2.50 | $15.00 | $0.25 | 2M | Very long context |
| Gemini 2.5 Flash | $0.30 | $2.50 | $0.03 | 1M | Fast mid-tier |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | — | 1M | Cheapest mainstream |
| Gemini 2.0 Flash | $0.10 | $0.40 | $0.025 | 1M | Ultra cheap, proven |
Google's tiered pricing on Gemini 2.5 Pro means that most developers — those working with documents under 200K tokens — never pay the premium rate. Context-sensitive pricing like this is something OpenAI and Anthropic haven't adopted yet, and it makes Gemini noticeably cheaper for the majority of real-world workloads. For more detail on every Google model and the free tier limits, see our Gemini API pricing guide.
DeepSeek
Source: api-docs.deepseek.com/pricing
DeepSeek shook the industry when it launched V3 at a fraction of competitors' prices in late 2025, and V3.2 continued the trend. Unified pricing for both chat and reasoning modes means you don't pay a premium for chain-of-thought — a meaningful advantage over providers that charge 2-4x more for reasoning-optimized models. The 90% cache discount makes DeepSeek particularly attractive for applications with long, stable system prompts.
For teams that prototyped on expensive models and now need to cut costs for production, DeepSeek is often the first place to look. Many engineering teams report that swapping a GPT-5 call for DeepSeek V3.2 on straightforward classification and extraction tasks produces nearly identical output quality at 4-5x lower cost. The catch is that DeepSeek's quality gap widens on tasks requiring deep multi-step reasoning or very precise instruction adherence — the same tasks where Claude and GPT-5.2 justify their premium. We covered the full story of how DeepSeek disrupted the pricing landscape in our piece on how a $6M model shattered the AI scaling myth.
| Model | Input/M | Output/M | Cached Input/M | Context | Best For |
|---|---|---|---|---|---|
| DeepSeek V3.2 (Chat) | $0.28 | $0.42 | $0.028 | 128K | General tasks, very cheap |
| DeepSeek V3.2 (Reasoner) | $0.28 | $0.42 | $0.028 | 128K | Reasoning, same price |
The main trade-off is reliability. DeepSeek has experienced periodic capacity constraints and higher latencies during peak hours, especially for users outside Asia. Some teams mitigate this by using DeepSeek as their primary provider and falling back to a secondary provider (typically Gemini Flash or GPT-5 Mini) when latency spikes. If you want the full picture on DeepSeek's strengths and limitations, our DeepSeek API pricing page goes deeper.
xAI (Grok)
Source: docs.x.ai/developers/models
xAI ships two models with a 2M token context window — the joint-largest available alongside Gemini 2.5 Pro. Grok 4 competes at the mid-to-premium tier, while Grok 4.1 Fast provides a budget option with the same massive context. New users get $25 in free credits, enough to run several thousand requests before paying.
| Model | Input/M | Output/M | Context | Best For |
|---|---|---|---|---|
| Grok 4 | $3.00 | $15.00 | 2M | Large context reasoning |
| Grok 4.1 Fast | $0.20 | $0.50 | 2M | Budget with huge context |
Grok 4.1 Fast is a sleeper pick for budget-conscious teams that need massive context windows. At $0.20/$0.50 per million tokens, you get a 2M context for roughly the same price as Gemini 2.5 Flash — but with xAI's reasoning capabilities. The trade-off is a smaller ecosystem, fewer third-party integrations, and less community documentation compared to OpenAI or Anthropic. If your workload is specifically long-document analysis and you don't need a deep library of function-calling patterns, Grok 4.1 Fast is worth benchmarking.
Mistral
Source: mistral.ai/pricing
Mistral is the leading European AI lab, and its models are popular with companies that need EU-hosted inference for GDPR compliance. Mistral Large 3 competes with mid-tier offerings from OpenAI and Anthropic. Mistral Nemo, at $0.02/M for both input and output, is technically the cheapest named model on any major provider — though its capabilities are limited to lightweight tasks like simple classification and template filling.
| Model | Input/M | Output/M | Context | Best For |
|---|---|---|---|---|
| Mistral Large 3 | $2.00 | $6.00 | 128K | European hosting, GDPR |
| Mistral Medium 3 | $0.40 | $2.00 | 128K | Mid-tier tasks |
| Mistral Nemo | $0.02 | $0.02 | 128K | Lightweight tasks |
| Ministral 8B | $0.10 | $0.10 | 128K | Cheapest Mistral option |
Mistral's GDPR story is its real differentiator. For European companies processing customer data, the compliance overhead of using a US-hosted provider often outweighs any cost savings. Mistral Large 3 at $2/$6 is a reasonable price for models hosted entirely within the EU, with data residency guarantees that OpenAI and Anthropic don't match. If your legal team has strong opinions about where data lives, Mistral saves you from months of compliance paperwork.
Meta Llama (Open Weights — Self-Hosted)
| Model | API Cost | Context | Notes |
|---|---|---|---|
| Llama 4 | Free | 200K | Host yourself or use a provider |
| Llama 3.3 | Free | 128K | Proven, well-supported |
Llama models are free to download but you pay for compute. Typical hosted pricing through providers like Together, Fireworks, or Groq ranges from $0.05-$0.90/M tokens depending on model size and provider. For teams with existing GPU infrastructure, self-hosting Llama eliminates per-token costs entirely — you just pay for electricity and hardware amortization. The break-even point depends on your volume: at 10M+ tokens per day, self-hosting usually wins. Below that, hosted APIs are simpler and cheaper when you factor in engineering time.
The open-weights ecosystem has matured significantly over the past year. Llama 4 brought 200K context and substantially improved reasoning, narrowing the gap with proprietary mid-tier models like GPT-5 Mini and Gemini 2.5 Flash. Fine-tuned Llama variants now power a meaningful share of production AI workloads, especially in regulated industries where data can't leave the organization's infrastructure. The open-source community around Llama also means a wealth of quantized variants, adapter tooling, and deployment scripts that lower the barrier to self-hosting.
Alibaba Qwen
Alibaba's Qwen family has become a serious contender in the global LLM market, especially after the Qwen 3.5 Plus release in early 2026. Pricing is aggressive — Qwen 3.5 Plus runs at roughly $1.20/M input tokens, undercutting most Western mid-tier models. Qwen models are popular with teams building apps for the Chinese market, but their multilingual capabilities have also attracted developers worldwide. The models are available through Alibaba Cloud's API and through several third-party hosting providers. For teams already on Alibaba Cloud infrastructure, the integration is seamless and there are additional volume discounts.
| Model | Input/M | Output/M | Context | Best For |
|---|---|---|---|---|
| Qwen 3.5 Plus | $1.20 | $4.80 | 128K | Multilingual, cost-effective |
| Qwen 3.5 Turbo | $0.30 | $0.90 | 128K | Fast, cheap tasks |
Price Ranking: Cheapest to Most Expensive
This table ranks every model by blended cost, assuming a 1:1 ratio of input to output tokens. Real-world ratios vary by use case — chatbots tend to produce more output than input, while classification tasks are heavily input-weighted. Adjust accordingly.
| Rank | Model | Input/M | Output/M | Blended $/M |
|---|---|---|---|---|
| 1 | Mistral Nemo | $0.02 | $0.02 | $0.02 |
| 2 | Ministral 8B | $0.10 | $0.10 | $0.10 |
| 3 | Gemini 2.0 Flash-Lite | $0.075 | $0.30 | $0.19 |
| 4 | GPT-5 Nano | $0.05 | $0.40 | $0.23 |
| 5 | Gemini 2.0 Flash | $0.10 | $0.40 | $0.25 |
| 6 | Gemini 2.5 Flash-Lite | $0.10 | $0.40 | $0.25 |
| 7 | DeepSeek V3.2 | $0.28 | $0.42 | $0.35 |
| 8 | Grok 4.1 Fast | $0.20 | $0.50 | $0.35 |
| 9 | Qwen 3.5 Turbo | $0.30 | $0.90 | $0.60 |
| 10 | GPT-5 Mini | $0.25 | $2.00 | $1.13 |
| 11 | Mistral Medium 3 | $0.40 | $2.00 | $1.20 |
| 12 | Gemini 2.5 Flash | $0.30 | $2.50 | $1.40 |
| 13 | o4-mini | $1.10 | $4.40 | $2.75 |
| 14 | Qwen 3.5 Plus | $1.20 | $4.80 | $3.00 |
| 15 | Mistral Large 3 | $2.00 | $6.00 | $4.00 |
| 16 | o3 / GPT-4.1 | $2.00 | $8.00 | $5.00 |
| 17 | Gemini 2.5 Pro | $1.25 | $10.00 | $5.63 |
| 18 | GPT-5 | $1.25 | $10.00 | $5.63 |
| 19 | GPT-5.4 | $2.50 | $10.00 | $6.25 |
| 20 | GPT-5.2 | $1.75 | $14.00 | $7.88 |
| 21 | Claude Sonnet 4.6 | $3.00 | $15.00 | $9.00 |
| 22 | Grok 4 | $3.00 | $15.00 | $9.00 |
| 23 | Claude Opus 4.6 | $5.00 | $25.00 | $15.00 |
| 24 | o3-pro | $20.00 | $80.00 | $50.00 |
| 25 | GPT-5.2 Pro | $21.00 | $168.00 | $94.50 |
| 26 | o1-pro | $150.00 | $600.00 | $375.00 |
The range is staggering: the cheapest usable model (Mistral Nemo at $0.02/M blended) is nearly 19,000x cheaper than o1-pro ($375/M). Even excluding the extremes, there's a 40x spread between GPT-5 Nano and Claude Opus 4.6. That spread creates real strategic decisions — the model you pick shapes your product's unit economics more than almost any other architectural choice.
Context Window Comparison
Context windows have expanded dramatically, and pricing often scales with them. Here's how the major models compare on maximum context size — a critical factor for document processing, codebase analysis, and long-conversation applications.
| Context Size | Models | Starting Input Price |
|---|---|---|
| 2M tokens | Gemini 2.5 Pro, Grok 4, Grok 4.1 Fast | $0.20/M |
| 1M tokens | Claude Opus 4.6, GPT-4.1, Gemini 2.5 Flash, Gemini 2.0 Flash | $0.10/M |
| 200K tokens | GPT-5.4, GPT-5.2, Claude Sonnet 4.6, o3, o4-mini | $1.10/M |
| 128K tokens | GPT-5, DeepSeek V3.2, Mistral Large 3, Llama 4, Qwen 3.5 Plus | $0.02/M |
Longer context windows aren't always better. Processing 1M+ tokens per request is expensive regardless of per-token price, and most models show quality degradation on information buried deep in the middle of very long contexts — a well-documented phenomenon known as "lost in the middle." Use the context you need, not the maximum available. For most applications — chatbots, code generation, summarization — 128K is more than enough. Long context shines for specific use cases like analyzing legal contracts, processing entire codebases, or maintaining very long conversation histories.
Real Cost Examples
Abstract per-million-token prices are hard to reason about. Here's what common workloads actually cost, using typical token counts measured from production applications.
Summarizing a 10-page document (~4,000 input tokens, ~500 output tokens)
| Model | Cost per doc | Cost for 1,000 docs |
|---|---|---|
| GPT-5 Nano | $0.0004 | $0.40 |
| Gemini 2.0 Flash | $0.0006 | $0.60 |
| DeepSeek V3.2 | $0.0013 | $1.33 |
| Claude Haiku 4.5 | $0.0065 | $6.50 |
| GPT-5.2 | $0.0140 | $14.00 |
| Claude Opus 4.6 | $0.0325 | $32.50 |
At the budget end, you can summarize 1,000 documents for under a dollar. Even Opus 4.6, the most expensive option here, costs just $32.50 for a thousand documents — a task that would have run several hundred dollars with GPT-4 Turbo in early 2024.
Chatbot conversation (avg ~800 input tokens, ~400 output tokens per turn)
| Model | Cost per turn | 10K users x 20 turns/day/month |
|---|---|---|
| Gemini 2.0 Flash | $0.00024 | $14/mo |
| DeepSeek V3.2 | $0.00039 | $23/mo |
| GPT-5 Mini | $0.001 | $60/mo |
| Claude Haiku 4.5 | $0.0028 | $168/mo |
| Claude Sonnet 4.6 | $0.0084 | $504/mo |
The gap between $14/month and $504/month is the difference between a viable side project and a serious infrastructure expense. For high-volume consumer chatbots, the budget models make the business model possible. A startup serving 10K daily active users can run their entire inference stack for the cost of a single SaaS subscription.
Code generation (avg ~2,000 input tokens, ~1,500 output tokens per request)
| Model | Cost per request | 500 requests/day/month |
|---|---|---|
| GPT-5 Nano | $0.0007 | $10.50/mo |
| DeepSeek V3.2 | $0.0012 | $18.00/mo |
| GPT-5.2 | $0.0245 | $367/mo |
| Claude Sonnet 4.6 | $0.0285 | $427/mo |
| Claude Opus 4.6 | $0.0475 | $712/mo |
Code generation is output-heavy, which amplifies the cost gap between models. GPT-5 Nano handles boilerplate and simple scripts. For complex multi-file edits and architectural reasoning, Sonnet 4.6 and GPT-5.2 justify their price with meaningfully better output. For a deeper look at coding-specific costs and tools, see our Cursor vs Windsurf vs Claude Code comparison.
RAG pipeline (retrieval-augmented generation: ~8,000 input tokens, ~800 output tokens)
| Model | Cost per query | 50K queries/month |
|---|---|---|
| DeepSeek V3.2 | $0.0026 | $128/mo |
| GPT-5 Mini | $0.0036 | $180/mo |
| Gemini 2.5 Flash | $0.0044 | $220/mo |
| Gemini 2.5 Pro | $0.018 | $900/mo |
| Claude Sonnet 4.6 | $0.036 | $1,800/mo |
RAG pipelines are input-heavy because you stuff retrieved context into every request. This makes input price and caching discounts critical. DeepSeek's 90% cache discount on system prompts can cut costs dramatically — a cached RAG pipeline on DeepSeek V3.2 runs at roughly $0.0005 per query when most of the context is cached. Production RAG systems that serve high-traffic search products should treat caching architecture as a first-class concern, not an afterthought. The difference between a naive implementation and a cache-optimized one is often 10x on the monthly bill.
Long-document analysis (~100,000 input tokens, ~2,000 output tokens)
| Model | Cost per doc | 100 docs/month |
|---|---|---|
| Gemini 2.5 Pro | $0.145 | $14.50/mo |
| GPT-5.4 | $0.270 | $27.00/mo |
| Grok 4 | $0.330 | $33.00/mo |
| Claude Opus 4.6 | $0.550 | $55.00/mo |
Long-document workloads — legal contracts, research papers, full codebase reviews — highlight the importance of context window pricing. Gemini 2.5 Pro leads here because its $1.25/M input rate applies to the first 200K tokens. Claude Opus 4.6's new 1M context window makes it viable for these tasks without chunking, which preserves quality at the cost of a higher per-token price.
Best Model by Use Case
High-volume chatbots and customer support
Pick: Gemini 2.0 Flash or DeepSeek V3.2
At $0.10-$0.28/M input, these handle simple Q&A and routing at pennies per thousand conversations. The quality works for FAQ responses, order status lookups, and first-level triage. Route the 5% of hard queries to Sonnet or GPT-5.2 as a fallback and keep overall costs low. This tiered approach — cheap model for most requests, expensive model for edge cases — has become the standard architecture for production chatbots. Teams that run everything through a single premium model are leaving money on the table.
Coding assistants and code generation
Pick: GPT-5.2 or Claude Sonnet 4.6
Both excel at code. GPT-5.2 is slightly cheaper ($1.75 vs $3.00 input). Claude Sonnet tends to follow complex instructions more precisely, especially for multi-step refactoring and tasks requiring large codebase context. For budget coding, DeepSeek V3.2 is surprisingly capable at $0.28/M — worth benchmarking against your specific use case before defaulting to a pricier model. For complex architectural decisions across entire repositories, Opus 4.6 with its 1M context window can reason about codebases that previously required chunking.
Document summarization and extraction
Pick: Gemini 2.5 Pro
The 2M context window means you can process entire documents without chunking, avoiding the accuracy loss that comes with splitting and recombining. At $1.25/M input (under 200K tokens), it's cheaper than Claude or GPT for long-context work. Gemini 2.5 Flash at $0.30/M is a solid alternative when you don't need the full 2M window.
Research and complex reasoning
Pick: Claude Opus 4.6 or GPT-5.2 Pro
For tasks where accuracy justifies the cost — legal analysis, scientific review, complex multi-step reasoning — these are the top-tier choices. Opus 4.6 at $5/$25 is dramatically cheaper than GPT-5.2 Pro at $21/$168 and often matches it on quality. Start with Opus and only escalate to GPT-5.2 Pro if you need that last increment of reasoning capability. For a detailed head-to-head comparison of these flagship models, check our GPT-5 vs Claude vs Gemini comparison.
Prototyping and experimentation
Pick: Gemini free tier or Llama 4 (self-hosted)
Gemini gives roughly 1,000 free requests per day. Llama 4 costs nothing if you have the hardware or can use a free-tier provider. Both eliminate cost as a barrier during development. Build your prototype on a free model, then upgrade to a paid model only after you've validated the product works.
Classification, tagging, and routing
Pick: GPT-5 Nano or Ministral 8B
Simple decision-making tasks don't need large models. At $0.05-$0.10/M input, you can classify millions of items for under $10. These models are fast, cheap, and accurate enough for binary decisions, sentiment analysis, and intent classification.
AI agent workflows
Pick: GPT-5.4 or Claude Sonnet 4.6
Agentic use cases — where a model calls tools, reasons over results, and takes multi-step actions — burn through tokens quickly. Each agent loop generates both input and output tokens, often across 5-20 iterations. A single complex agent task can consume 50K-200K tokens before producing a final result. GPT-5.4 at $2.50/$10 offers a strong balance of reasoning and cost for agentic workflows. Claude Sonnet 4.6's instruction-following strength helps agents stay on track across longer chains without going off-script. For a full breakdown of agent costs, see our AI agent pricing guide.
Cost Optimization Tips
Picking the right model is step one. The techniques below can cut your costs by another 50-95% on top of that. We covered these strategies in depth in our guide to saving 90% on inference costs, but here's the practical summary.
1. Prompt caching (saves 75-90%)
Every major provider now offers prompt caching. If your system prompt or few-shot examples stay the same across requests, cached tokens cost a fraction of the base price:
| Provider | Cache Savings | Cached Input Cost (flagship) |
|---|---|---|
| OpenAI | 90% off | $0.175/M (GPT-5.2) |
| Anthropic | 90% off | $0.50/M (Opus 4.6) |
| 75% off | $0.31/M (Gemini 2.5 Pro) | |
| DeepSeek | 90% off | $0.028/M (V3.2) |
With a 2,000-token system prompt sent 100K times: uncached costs $1.00 with GPT-5.2, cached costs $0.035. That's $96.50 saved per 100K requests on the system prompt alone. For applications with long, stable context — RAG pipelines, multi-turn conversations with persistent instructions — caching is the single biggest cost lever available.
2. Batch API (saves 50%)
OpenAI's Batch API processes requests asynchronously within 24 hours at half price. Anthropic offers the same deal. This works well for:
- Nightly data processing and ETL pipelines
- Bulk content generation and localization
- Evaluation and testing pipelines
- Any workload that doesn't need real-time responses
If you can tolerate 24-hour latency, batch processing should be your default for non-interactive workloads. The savings compound significantly when combined with caching — batch requests still benefit from cached tokens.
3. Model routing (saves 60-80%)
Don't send every request to your best model. Route by complexity:
Simple query → GPT-5 Nano ($0.05/M)
Medium query → GPT-5 Mini ($0.25/M)
Hard query → GPT-5.2 ($1.75/M)If 70% of your traffic is simple, 20% medium, and 10% hard, your effective cost drops from $1.75/M to about $0.27/M — an 85% reduction. You can build a simple router using a cheap classifier model, or use heuristics like query length and keyword matching. Several startups now offer model-routing middleware that handles this automatically.
4. Output token management
Output tokens cost 4-8x more than input tokens across most providers. This asymmetry means controlling output length has an outsized impact on your bill. Practical ways to reduce output costs:
- Ask for structured JSON instead of verbose prose
- Set explicit max_tokens limits on every request
- Request bullet points instead of paragraphs
- Include "be concise" or "respond in under 100 words" in your system prompt (it measurably works)
- For extraction tasks, return only the extracted fields, not explanations
5. Cache-aware prompt design (DeepSeek and Anthropic)
Structure your prompts so stable context (system instructions, few-shot examples, reference documents) comes first, and variable content (the user query) comes last. This maximizes cache hit rates because providers cache from the beginning of the prompt. A well-structured prompt can achieve 80%+ cache hit rates in production, turning $0.28/M into $0.028/M on DeepSeek.
6. Stack discounts
Combine prompt caching + batch API + model routing for maximum savings. Example with Anthropic:
- Base Opus 4.6: $5.00/M input
- With caching: $0.50/M input (90% off)
- With batch: $0.25/M input (additional 50% off)
- Effective cost: 95% cheaper than list price
At $0.25/M input, Opus 4.6 becomes cheaper than many providers' mid-tier models at list price. The premium models aren't as expensive as they look once you apply all available discounts.
How Prices Have Changed: 2024 to 2026
The speed of price drops over the past two years has been remarkable. GPT-4 Turbo launched in late 2023 at $10/$30 per million tokens. Less than three years later, GPT-5.4 delivers substantially better performance at $2.50/$10. That's a 75% drop in per-token cost paired with a significant leap in capability — the kind of deflation that reshapes entire industries.
| Period | OpenAI Flagship | Anthropic Flagship | Google Flagship |
|---|---|---|---|
| Early 2024 | GPT-4 Turbo: $10/$30 | Claude 3 Opus: $15/$75 | Gemini 1.5 Pro: $7/$21 |
| Late 2024 | GPT-4o: $2.50/$10 | Claude 3.5 Sonnet: $3/$15 | Gemini 1.5 Flash: $0.075/$0.30 |
| Early 2025 | o3-mini: $1.10/$4.40 | Claude 3.5 Haiku: $1/$5 | Gemini 2.0 Flash: $0.10/$0.40 |
| April 2026 | GPT-5.4: $2.50/$10 | Claude Opus 4.6: $5/$25 | Gemini 2.5 Pro: $1.25/$10 |
The pattern is clear: each generation delivers more capability at the same or lower price. Anthropic's Opus went from $15/$75 to $5/$25 — a 67% drop — while also expanding from 200K to 1M context. Google pushed Flash-tier pricing below $0.10/M, creating a viable free-to-cheap tier that didn't exist two years ago. The mid-tier sweet spot has shifted from $3-5/M to $0.25-1.50/M, opening up use cases that weren't economically viable even 18 months ago.
Competitive pressure from DeepSeek and open-source models like Llama has been a major accelerant. When a capable model appears at $0.28/M, it forces every other provider to justify their premium with measurable quality advantages. The open-source ecosystem has also driven down hosted inference prices — providers like Together and Fireworks compete aggressively on Llama hosting, pushing margins toward zero on commodity models. We explored this dynamic in depth in our analysis of how inference costs are collapsing.
For a broader look at the companies driving these price wars — from Anthropic's $30B run-rate to Google's TPU investments — see our coverage of the AI capital rush and Anthropic's valuation.
Frequently Asked Questions
What is the cheapest LLM API in 2026?
Gemini 2.0 Flash-Lite at $0.075/$0.30 per million tokens is the cheapest mainstream option from a major provider. Mistral Nemo costs just $0.02/M tokens for both input and output, making it the absolute cheapest named model — though its capabilities are limited to simple tasks. For a dedicated comparison of budget options, see our cheapest LLM API guide.
Which LLM has the best price-to-performance ratio?
DeepSeek V3.2 offers the strongest value at $0.28/$0.42 per million tokens. It unifies chat and reasoning into one model at one price, meaning you get chain-of-thought capability without the 2-4x markup that other providers charge for reasoning models.
How much does GPT-5.4 cost?
GPT-5.4 costs $2.50/$10.00 per million input/output tokens. Cached input drops to $0.25/M. It slots between GPT-5 ($1.25/$10) and GPT-5.2 ($1.75/$14) in capability, with better reasoning than GPT-5 at a modest premium.
Is Claude cheaper than GPT?
It depends on the tier. Claude Haiku 4.5 at $1/$5 is more expensive than GPT-5 Nano ($0.05/$0.40) or GPT-5 Mini ($0.25/$2.00). Claude Sonnet 4.6 at $3/$15 costs more than GPT-5.2 ($1.75/$14). Claude Opus 4.6 at $5/$25 is dramatically cheaper than GPT-5.2 Pro ($21/$168). At the premium tier, Claude offers better value; at the budget tier, OpenAI wins on price.
Does Google offer free LLM API access?
Yes. Google offers free input/output tokens on most Gemini models (2.5 Flash, Flash-Lite, 2.0 Flash, etc.) with roughly 1,000 requests per day. This is the most generous free tier of any major provider and is sufficient for prototyping, hobby projects, and low-traffic applications.
How can I save money on LLM API costs?
The three highest-impact techniques: (1) Prompt caching saves 75-90% on repeated context like system prompts and few-shot examples. (2) Batch API saves 50% for non-real-time workloads. (3) Model routing — sending simple queries to cheap models and hard queries to expensive models — typically saves 60-80%. Stack all three for up to 95% total savings.
Should I self-host an open-source model instead?
Self-hosting makes sense if you process enough volume to justify dedicated GPU infrastructure, need full data control, or have specialized fine-tuning requirements. For most teams, API pricing is cheaper than self-hosting until you hit roughly 10-50M tokens per day, depending on the model size and your hardware costs. Below that threshold, APIs are simpler and more cost-effective.
Which model has the largest context window?
Gemini 2.5 Pro and Grok 4 share the lead at 2M tokens. Claude Opus 4.6 and GPT-4.1 offer 1M token context. Most other flagship models (GPT-5.4, Claude Sonnet 4.6, o3) support 200K tokens. DeepSeek V3.2 and Mistral models cap at 128K. Choose based on your actual needs — most applications work fine with 128K context, and paying for 2M tokens when you only use 10K is wasteful.
Cost Calculator Resources
- PricePerToken — Compare 300+ models side by side
- LLM Pricing — 72+ models with filtering
- CostGoat — Calculator with usage projections
The Bottom Line
LLM API prices dropped roughly 80% across the board from 2024 to 2026. The gap between "cheap" and "premium" now exceeds 1,000x (Mistral Nemo at $0.02/M vs o1-pro at $375/M blended). For most production applications, models in the $0.10-$3.00/M range get the job done. Save the expensive models for tasks where quality directly impacts revenue, safety, or compliance.
The biggest shift this year is that premium models are becoming accessible. Opus 4.6 with 1M context at $5/$25 — after caching and batch discounts — costs less than mid-tier models did a year ago. GPT-5.4 brings flagship reasoning to $2.50/$10. These aren't budget compromises; they're the best models available, at prices that work for production.
Combine prompt caching, batch processing, and model routing to squeeze another 90%+ out of your bill. The cost of intelligence is falling fast — the real expense is choosing the wrong model for the job.
More AI Resources
- AI Trends 2026 — What's next in AI
- Best AI Coding Tools — Dev tools comparison
- AI Companies Landscape — Provider directory
- AI Agent Pricing — Agent costs breakdown
Prices sourced from official provider websites as of April 2026. LLM pricing changes fast — verify current rates at OpenAI, Anthropic, Google, DeepSeek, xAI, and Mistral before committing.
Related Resources
Want more resources?
Subscribe to get the latest AI tools, guides, and updates.
Newsletter
Stay ahead of the curve
Key insights from top tech podcasts, delivered daily. Join 10,000+ engineers, founders, and investors.
One email per day. Unsubscribe anytime.