Skip to main content
API Reference

LLM API Pricing Reference: Quick Comparison & Best-Use Guide

Evergreen developer API pricing table with SWE-bench scores, Capy-Score cost/quality index, best-for notes, and links to StackCapybara model reviews.

This is StackCapybara’s evergreen API pricing reference — expanded with benchmark columns, our Capy-Score cost/quality index, and per-model profiles. Rates are per 1 million tokens on direct developer APIs.

For long-form routing recipes see LLM API Pricing vs. Quality. For consumer plans see subscription pricing.

Pricing verified as of 2026-06-14. Re-check vendor docs before production contracts. Benchmarks cite our flagship research sheet where noted.

What is Capy-Score?

Capy-Score (0–99) is StackCapybara’s cost/quality index for API models. Higher means more verified capability per dollar for a typical agent workload (we assume ~30% input / 70% output tokens).

  • Quality input: published SWE-bench Pro (45%), SWE-bench Verified (20%), GPQA Diamond (20%), Terminal-Bench (15%) — reweighted when a model lacks verified rows.
  • Cost input: blended $/M from listed input + output API rates.
  • Not a ranking guarantee — a high Capy-Score on Flash tiers reflects value for volume tasks, not repo-scale SWE. Route hard jobs to Opus or GPT-5.5 Pro regardless of score.
  • Models without enough verified benchmarks show until we publish a dedicated review.

API pricing & benchmarks

Model (API) In / Out per 1M SWE Pro SWE Verified GPQA Terminal-Bench Capy-Score
DeepSeek-V4 Flash
deepseek-v4-flash
$0.14 / $0.28
Up to ~90% off cache hits
DeepSeek-V4 Pro
deepseek-v4-pro
$1.74 / $3.48
Up to ~90% off cache hits
Kimi-K2.7-Code
kimi-k2.7-code
$0.95 / $4
Preserved Thinking (no extra replay cost)
Grok 4.3
grok-4.3
$1.25 / $2.5
Pay-as-you-go
Gemini 3.5 Flash
gemini-3.5-flash
$1.5 / $9
$0.15 / 1M input (90% off)
55.1% 92.2% 76.2% 45
Claude 4.5/4.6 Sonnet
claude-sonnet-4-6
$3 / $15
$0.30 / 1M input (90% off reads)
93% 36
Claude Opus 4.8
claude-opus-4-8
$5 / $25
$0.50 / 1M input (90% off reads)
69.2% 88.6% 93.6% 74.6% 18
GPT-5.5 (Standard)
gpt-5.5
$5 / $30
$0.50 / 1M input (90% off)
85.7% 93.6% 82.7% 17
GPT-5.5 Pro (Reasoning)
gpt-5.5-pro
$30 / $180
$3.00 / 1M input (90% off)
85.7% 93.6% 82.7% 5

Per-model profiles

Click a model name in the table to jump to its profile. Dedicated single-model reviews are linked when published.

DeepSeek-V4 Flash

The default volume tier when you need millions of cheap API calls. Pair with a router that escalates hard tasks to Pro or a flagship model.

  • API ID: deepseek-v4-flash
  • Pricing: $0.14 input / $0.28 output per 1M · Up to ~90% off cache hits
  • Context: 128K in / 8K out
  • Best for: High-volume extraction, classification, and cheap always-on agent loops under 128K context.
  • Benchmarks: No public SWE-bench Pro / GPQA sheet in our verified research set — treat as budget tier; Capy-Score withheld.
  • Capy-Score: — (insufficient verified benchmarks)
  • Primary review: DeepSeek-V4 API review
  • Also mentioned in:

DeepSeek-V4 Pro

Step up from Flash when tool use and syntax quality matter but you are not ready to pay flagship per-token rates.

  • API ID: deepseek-v4-pro
  • Pricing: $1.74 input / $3.48 output per 1M · Up to ~90% off cache hits
  • Context: 128K in / 16K out
  • Best for: Stateless coding automation, tool-call chains, and one-shot CI review passes at ~60% lower cost than Sonnet-class APIs.
  • Benchmarks: Vendor claims strong MoE coding; SWE-bench Pro not in our verified sheet yet.
  • Capy-Score: — (insufficient verified benchmarks)
  • Primary review: DeepSeek-V4 API review (Pro tier)
  • Also mentioned in:

Kimi-K2.7-Code

Optimized for agent loops, not single-shot answers. Best when sessions run 10+ tool rounds on the same repo.

  • API ID: kimi-k2.7-code
  • Pricing: $0.95 input / $4 output per 1M · Preserved Thinking (no extra replay cost)
  • Context: 256K in / 16K out
  • Best for: Long multi-turn terminal/coding sessions where preserved chain-of-thought cuts output-token waste.
  • Benchmarks: Moonshot publishes loop-efficiency claims; add SWE-bench row when vendor sheet is verified.
  • Capy-Score: — (insufficient verified benchmarks)
  • Primary review: Kimi Code & Kimi-K2.7-Code review
  • Also mentioned in:

Grok 4.3

Useful third-rail option when you want API diversity outside OpenAI / Google / Anthropic.

  • API ID: grok-4.3
  • Pricing: $1.25 input / $2.5 output per 1M · Pay-as-you-go
  • Context: 128K in / 8K out
  • Best for: Mid-cost general API workloads and X-ecosystem integrations where real-time social context matters.
  • Benchmarks: Benchmark sheet not yet verified in StackCapybara research bundle.
  • Capy-Score: — (insufficient verified benchmarks)
  • Primary review: Grok 4.3 API review
  • Also mentioned in:

Gemini 3.5 Flash

Not a peer flagship — a fast, cheap router tier. Excellent GPQA for the price; SWE Pro mid-tier.

Claude 4.5/4.6 Sonnet

The pragmatic Anthropic tier for interactive coding. Terminal workflow review: Claude Code.

Claude Opus 4.8

Record SWE-bench Pro in our verified set. Pay for accuracy when a bad diff is expensive.

GPT-5.5 (Standard)

Strong Terminal-Bench and GPQA scores. Escalate to Pro only for the hardest shell/OS tasks.

GPT-5.5 Pro (Reasoning)

Reserve for migrations, infra repair, and OSWorld-class tasks — not everyday chat.

Review coverage

Dedicated API reviews (linked from the table above):

Related guides

Official pricing sources:
DeepSeek ·
Moonshot/Kimi ·
xAI Grok ·
Google Gemini ·
Anthropic ·
OpenAI