LLM API Pricing Reference: Quick Comparison & Best-Use Guide
Evergreen developer API pricing table with SWE-bench scores, Capy-Score cost/quality index, best-for notes, and links to StackCapybara model reviews.
This is StackCapybara’s evergreen API pricing reference — expanded with benchmark columns, our Capy-Score cost/quality index, and per-model profiles. Rates are per 1 million tokens on direct developer APIs.
For long-form routing recipes see LLM API Pricing vs. Quality. For consumer plans see subscription pricing.
Pricing verified as of 2026-06-14. Re-check vendor docs before production contracts. Benchmarks cite our flagship research sheet where noted.
What is Capy-Score?
Capy-Score (0–99) is StackCapybara’s cost/quality index for API models. Higher means more verified capability per dollar for a typical agent workload (we assume ~30% input / 70% output tokens).
- Quality input: published SWE-bench Pro (45%), SWE-bench Verified (20%), GPQA Diamond (20%), Terminal-Bench (15%) — reweighted when a model lacks verified rows.
- Cost input: blended $/M from listed input + output API rates.
- Not a ranking guarantee — a high Capy-Score on Flash tiers reflects value for volume tasks, not repo-scale SWE. Route hard jobs to Opus or GPT-5.5 Pro regardless of score.
- Models without enough verified benchmarks show — until we publish a dedicated review.
API pricing & benchmarks
| Model (API) | In / Out per 1M | SWE Pro | SWE Verified | GPQA | Terminal-Bench | Capy-Score |
|---|---|---|---|---|---|---|
DeepSeek-V4 Flashdeepseek-v4-flash |
$0.14 / $0.28 Up to ~90% off cache hits |
— | — | — | — | — |
DeepSeek-V4 Prodeepseek-v4-pro |
$1.74 / $3.48 Up to ~90% off cache hits |
— | — | — | — | — |
Kimi-K2.7-Codekimi-k2.7-code |
$0.95 / $4 Preserved Thinking (no extra replay cost) |
— | — | — | — | — |
Grok 4.3grok-4.3 |
$1.25 / $2.5 Pay-as-you-go |
— | — | — | — | — |
Gemini 3.5 Flashgemini-3.5-flash |
$1.5 / $9 $0.15 / 1M input (90% off) |
55.1% | — | 92.2% | 76.2% | 45 |
Claude 4.5/4.6 Sonnetclaude-sonnet-4-6 |
$3 / $15 $0.30 / 1M input (90% off reads) |
— | — | 93% | — | 36 |
Claude Opus 4.8claude-opus-4-8 |
$5 / $25 $0.50 / 1M input (90% off reads) |
69.2% | 88.6% | 93.6% | 74.6% | 18 |
GPT-5.5 (Standard)gpt-5.5 |
$5 / $30 $0.50 / 1M input (90% off) |
— | 85.7% | 93.6% | 82.7% | 17 |
GPT-5.5 Pro (Reasoning)gpt-5.5-pro |
$30 / $180 $3.00 / 1M input (90% off) |
— | 85.7% | 93.6% | 82.7% | 5 |
Per-model profiles
Click a model name in the table to jump to its profile. Dedicated single-model reviews are linked when published.
DeepSeek-V4 Flash
The default volume tier when you need millions of cheap API calls. Pair with a router that escalates hard tasks to Pro or a flagship model.
- API ID:
deepseek-v4-flash - Pricing: $0.14 input / $0.28 output per 1M · Up to ~90% off cache hits
- Context: 128K in / 8K out
- Best for: High-volume extraction, classification, and cheap always-on agent loops under 128K context.
- Benchmarks: No public SWE-bench Pro / GPQA sheet in our verified research set — treat as budget tier; Capy-Score withheld.
- Capy-Score: — (insufficient verified benchmarks)
- Primary review: DeepSeek-V4 API review
- Also mentioned in:
DeepSeek-V4 Pro
Step up from Flash when tool use and syntax quality matter but you are not ready to pay flagship per-token rates.
- API ID:
deepseek-v4-pro - Pricing: $1.74 input / $3.48 output per 1M · Up to ~90% off cache hits
- Context: 128K in / 16K out
- Best for: Stateless coding automation, tool-call chains, and one-shot CI review passes at ~60% lower cost than Sonnet-class APIs.
- Benchmarks: Vendor claims strong MoE coding; SWE-bench Pro not in our verified sheet yet.
- Capy-Score: — (insufficient verified benchmarks)
- Primary review: DeepSeek-V4 API review (Pro tier)
- Also mentioned in:
Kimi-K2.7-Code
Optimized for agent loops, not single-shot answers. Best when sessions run 10+ tool rounds on the same repo.
- API ID:
kimi-k2.7-code - Pricing: $0.95 input / $4 output per 1M · Preserved Thinking (no extra replay cost)
- Context: 256K in / 16K out
- Best for: Long multi-turn terminal/coding sessions where preserved chain-of-thought cuts output-token waste.
- Benchmarks: Moonshot publishes loop-efficiency claims; add SWE-bench row when vendor sheet is verified.
- Capy-Score: — (insufficient verified benchmarks)
- Primary review: Kimi Code & Kimi-K2.7-Code review
- Also mentioned in:
Grok 4.3
Useful third-rail option when you want API diversity outside OpenAI / Google / Anthropic.
- API ID:
grok-4.3 - Pricing: $1.25 input / $2.5 output per 1M · Pay-as-you-go
- Context: 128K in / 8K out
- Best for: Mid-cost general API workloads and X-ecosystem integrations where real-time social context matters.
- Benchmarks: Benchmark sheet not yet verified in StackCapybara research bundle.
- Capy-Score: — (insufficient verified benchmarks)
- Primary review: Grok 4.3 API review
- Also mentioned in:
Gemini 3.5 Flash
Not a peer flagship — a fast, cheap router tier. Excellent GPQA for the price; SWE Pro mid-tier.
- API ID:
gemini-3.5-flash - Pricing: $1.5 input / $9 output per 1M · $0.15 / 1M input (90% off)
- Context: 1.04M in / 64K out
- Best for: Speed-tier default: huge context ingestion, multimodal parsing, and search-grounded verification.
- Benchmarks: SWE-bench Verified unverified in our sheet; Pro and GPQA cited in flagship research.
- Capy-Score: 45 / 99
- Primary review: Gemini 3.5 Flash API review
- Also mentioned in:
Claude 4.5/4.6 Sonnet
The pragmatic Anthropic tier for interactive coding. Terminal workflow review: Claude Code.
- API ID:
claude-sonnet-4-6 - Pricing: $3 input / $15 output per 1M · $0.30 / 1M input (90% off reads)
- Context: 1M in / 16K out
- Best for: Fast IDE-style edits, conversational refactors, and daily builder workflows with strong coding accuracy.
- Benchmarks: GPQA estimated from Sonnet-class vendor range; use Opus row for verified SWE Pro ceiling.
- Capy-Score: 36 / 99
- Primary review: Claude Sonnet API review
- Also mentioned in:
Claude Opus 4.8
Record SWE-bench Pro in our verified set. Pay for accuracy when a bad diff is expensive.
- API ID:
claude-opus-4-8 - Pricing: $5 input / $25 output per 1M · $0.50 / 1M input (90% off reads)
- Context: 1M in / 128K out
- Best for: Repo-scale SWE, multi-file architecture refactors, and highest-stakes code reasoning.
- Benchmarks: Verified in StackCapybara flagship research (May 2026).
- Capy-Score: 18 / 99
- Primary review: Claude Opus 4.8 API review
- Also mentioned in:
GPT-5.5 (Standard)
Strong Terminal-Bench and GPQA scores. Escalate to Pro only for the hardest shell/OS tasks.
- API ID:
gpt-5.5 - Pricing: $5 input / $30 output per 1M · $0.50 / 1M input (90% off)
- Context: 1M in / 128K out
- Best for: Balanced general API default, math-capable agents, and mixed tool-calling pipelines.
- Benchmarks: SWE Verified mid-range 82.6–88.7% in research; SWE Pro unverified.
- Capy-Score: 17 / 99
- Primary review: GPT-5.5 Standard API review
- Also mentioned in:
GPT-5.5 Pro (Reasoning)
Reserve for migrations, infra repair, and OSWorld-class tasks — not everyday chat.
- API ID:
gpt-5.5-pro - Pricing: $30 input / $180 output per 1M · $3.00 / 1M input (90% off)
- Context: 1M in / 128K out
- Best for: DevOps automation, terminal execution guards, and desktop/OS agents where failure cost dominates token cost.
- Benchmarks: Quality similar to Standard on published suites; pricing is the differentiator.
- Capy-Score: 5 / 99
- Primary review: GPT-5.5 Pro API review
- Also mentioned in:
Review coverage
Dedicated API reviews (linked from the table above):
- DeepSeek-V4 Flash & Pro
- Kimi-K2.7-Code
- Grok 4.3
- Gemini 3.5 Flash
- Claude Sonnet
- Claude Opus 4.8
- GPT-5.5 Standard
- GPT-5.5 Pro
Related guides
- LLM API Pricing vs. Quality — task routing & agent-stack recipes
- GPT-5.5 vs Gemini Flash vs Claude Opus — flagship benchmark deep dive
- Best AI coding agents 2026
- Subscription pricing — Plus / Pro / Advanced plans
Official pricing sources:
DeepSeek ·
Moonshot/Kimi ·
xAI Grok ·
Google Gemini ·
Anthropic ·
OpenAI