LLM API Pricing Reference: Quick Comparison & Best-Use Guide

Q: What is Capy-Score?

Capy-Score (0–99) is StackCapybara's cost/quality index for API models. Higher means more verified capability per dollar for a typical agent workload (we assume ~30% input / 70% output tokens).

This is StackCapybara’s evergreen API pricing reference — expanded with benchmark columns, our Capy-Score cost/quality index, and per-model profiles. Rates are per 1 million tokens on direct developer APIs.

For long-form routing recipes see LLM API Pricing vs. Quality. For consumer plans see subscription pricing.

Pricing verified as of 2026-06-14. Re-check vendor docs before production contracts. Benchmarks cite our flagship research sheet where noted.

What is Capy-Score?

Capy-Score (0–99) is StackCapybara’s cost/quality index for API models. Higher means more verified capability per dollar for a typical agent workload (we assume ~30% input / 70% output tokens).

Quality input: published SWE-bench Pro (45%), SWE-bench Verified (20%), GPQA Diamond (20%), Terminal-Bench (15%) — reweighted when a model lacks verified rows.
Cost input: blended $/M from listed input + output API rates.
Not a ranking guarantee — a high Capy-Score on Flash tiers reflects value for volume tasks, not repo-scale SWE. Route hard jobs to Opus or GPT-5.5 Pro regardless of score.
Models without enough verified benchmarks show — until we publish a dedicated review.

API pricing & benchmarks

Model (API)	In / Out per 1M	SWE Pro	SWE Verified	GPQA	Terminal-Bench	Capy-Score
DeepSeek-V4 Flash `deepseek-v4-flash`	$0.14 / $0.28 Up to ~90% off cache hits	—	—	—	—	—
DeepSeek-V4 Pro `deepseek-v4-pro`	$1.74 / $3.48 Up to ~90% off cache hits	—	—	—	—	—
Kimi-K2.7-Code `kimi-k2.7-code`	$0.95 / $4 Preserved Thinking (no extra replay cost)	—	—	—	—	—
Grok 4.3 `grok-4.3`	$1.25 / $2.5 Pay-as-you-go	—	—	—	—	—
Gemini 3.5 Flash `gemini-3.5-flash`	$1.5 / $9 $0.15 / 1M input (90% off)	55.1%	—	92.2%	76.2%	45
Claude 4.5/4.6 Sonnet `claude-sonnet-4-6`	$3 / $15 $0.30 / 1M input (90% off reads)	—	—	93%	—	36
Claude Opus 4.8 `claude-opus-4-8`	$5 / $25 $0.50 / 1M input (90% off reads)	69.2%	88.6%	93.6%	74.6%	18
GPT-5.5 (Standard) `gpt-5.5`	$5 / $30 $0.50 / 1M input (90% off)	—	85.7%	93.6%	82.7%	17
GPT-5.5 Pro (Reasoning) `gpt-5.5-pro`	$30 / $180 $3.00 / 1M input (90% off)	—	85.7%	93.6%	82.7%	5

Per-model profiles

Click a model name in the table to jump to its profile. Dedicated single-model reviews are linked when published.

DeepSeek-V4 Flash

The default volume tier when you need millions of cheap API calls. Pair with a router that escalates hard tasks to Pro or a flagship model.

API ID: deepseek-v4-flash
Pricing: $0.14 input / $0.28 output per 1M · Up to ~90% off cache hits
Context: 128K in / 8K out
Best for: High-volume extraction, classification, and cheap always-on agent loops under 128K context.
Benchmarks: No public SWE-bench Pro / GPQA sheet in our verified research set — treat as budget tier; Capy-Score withheld.
Capy-Score: — (insufficient verified benchmarks)
Primary review: DeepSeek-V4 API review
Also mentioned in:
- LLM API Pricing vs. Quality guide
- This reference page

DeepSeek-V4 Pro

Step up from Flash when tool use and syntax quality matter but you are not ready to pay flagship per-token rates.

API ID: deepseek-v4-pro
Pricing: $1.74 input / $3.48 output per 1M · Up to ~90% off cache hits
Context: 128K in / 16K out
Best for: Stateless coding automation, tool-call chains, and one-shot CI review passes at ~60% lower cost than Sonnet-class APIs.
Benchmarks: Vendor claims strong MoE coding; SWE-bench Pro not in our verified sheet yet.
Capy-Score: — (insufficient verified benchmarks)
Primary review: DeepSeek-V4 API review (Pro tier)
Also mentioned in:
- LLM API Pricing vs. Quality guide
- This reference page

Kimi-K2.7-Code

Optimized for agent loops, not single-shot answers. Best when sessions run 10+ tool rounds on the same repo.

API ID: kimi-k2.7-code
Pricing: $0.95 input / $4 output per 1M · Preserved Thinking (no extra replay cost)
Context: 256K in / 16K out
Best for: Long multi-turn terminal/coding sessions where preserved chain-of-thought cuts output-token waste.
Benchmarks: Moonshot publishes loop-efficiency claims; add SWE-bench row when vendor sheet is verified.
Capy-Score: — (insufficient verified benchmarks)
Primary review: Kimi Code & Kimi-K2.7-Code review
Also mentioned in:

Grok 4.3

Useful third-rail option when you want API diversity outside OpenAI / Google / Anthropic.

API ID: grok-4.3
Pricing: $1.25 input / $2.5 output per 1M · Pay-as-you-go
Context: 128K in / 8K out
Best for: Mid-cost general API workloads and X-ecosystem integrations where real-time social context matters.
Benchmarks: Benchmark sheet not yet verified in StackCapybara research bundle.
Capy-Score: — (insufficient verified benchmarks)
Primary review: Grok 4.3 API review
Also mentioned in:
- Automation pricing guide
- This reference page

Gemini 3.5 Flash

Not a peer flagship — a fast, cheap router tier. Excellent GPQA for the price; SWE Pro mid-tier.

API ID: gemini-3.5-flash
Pricing: $1.5 input / $9 output per 1M · $0.15 / 1M input (90% off)
Context: 1.04M in / 64K out
Best for: Speed-tier default: huge context ingestion, multimodal parsing, and search-grounded verification.
Benchmarks: SWE-bench Verified unverified in our sheet; Pro and GPQA cited in flagship research.
Capy-Score: 45 / 99
Primary review: Gemini 3.5 Flash API review
Also mentioned in:

Claude 4.5/4.6 Sonnet

The pragmatic Anthropic tier for interactive coding. Terminal workflow review: Claude Code.

API ID: claude-sonnet-4-6
Pricing: $3 input / $15 output per 1M · $0.30 / 1M input (90% off reads)
Context: 1M in / 16K out
Best for: Fast IDE-style edits, conversational refactors, and daily builder workflows with strong coding accuracy.
Benchmarks: GPQA estimated from Sonnet-class vendor range; use Opus row for verified SWE Pro ceiling.
Capy-Score: 36 / 99
Primary review: Claude Sonnet API review
Also mentioned in:

Claude Opus 4.8

Record SWE-bench Pro in our verified set. Pay for accuracy when a bad diff is expensive.

API ID: claude-opus-4-8
Pricing: $5 input / $25 output per 1M · $0.50 / 1M input (90% off reads)
Context: 1M in / 128K out
Best for: Repo-scale SWE, multi-file architecture refactors, and highest-stakes code reasoning.
Benchmarks: Verified in StackCapybara flagship research (May 2026).
Capy-Score: 18 / 99
Primary review: Claude Opus 4.8 API review
Also mentioned in:

GPT-5.5 (Standard)

Strong Terminal-Bench and GPQA scores. Escalate to Pro only for the hardest shell/OS tasks.

API ID: gpt-5.5
Pricing: $5 input / $30 output per 1M · $0.50 / 1M input (90% off)
Context: 1M in / 128K out
Best for: Balanced general API default, math-capable agents, and mixed tool-calling pipelines.
Benchmarks: SWE Verified mid-range 82.6–88.7% in research; SWE Pro unverified.
Capy-Score: 17 / 99
Primary review: GPT-5.5 Standard API review
Also mentioned in:

GPT-5.5 Pro (Reasoning)

Reserve for migrations, infra repair, and OSWorld-class tasks — not everyday chat.

API ID: gpt-5.5-pro
Pricing: $30 input / $180 output per 1M · $3.00 / 1M input (90% off)
Context: 1M in / 128K out
Best for: DevOps automation, terminal execution guards, and desktop/OS agents where failure cost dominates token cost.
Benchmarks: Quality similar to Standard on published suites; pricing is the differentiator.
Capy-Score: 5 / 99
Primary review: GPT-5.5 Pro API review
Also mentioned in:

Review coverage

Dedicated API reviews (linked from the table above):

Related guides

LLM API Pricing vs. Quality — task routing & agent-stack recipes
GPT-5.5 vs Gemini Flash vs Claude Opus — flagship benchmark deep dive
Best AI coding agents 2026
Subscription pricing — Plus / Pro / Advanced plans

Official pricing sources:
DeepSeek ·
Moonshot/Kimi ·
xAI Grok ·
Google Gemini ·
Anthropic ·
OpenAI