GPT-5.5 API Review (2026): Pricing, Benchmarks & Developer Guide
Best For: Complex code generation, structured extraction, high-value segmentation logic, and user-facing agents that need sub-2s time-to-first-token.
* Affiliate Disclosure: We may earn a commission at no cost to you.
Bottom Line
GPT-5.5 is OpenAI’s premium standard API tier — 1M context, stronger tool dispatch than 5.4, and top Terminal-Bench scores. Use for quality-critical work; escalate to Pro only when errors are costly.
GPT-5.5 API: Quick Summary
OpenAI’s GPT-5.5 is the company’s flagship multimodal model as of mid-2026, representing a significant step forward from the GPT-4 family. Available under the model ID gpt-5-5 (standard) and gpt-5-5-pro for enhanced reasoning tasks, GPT-5.5 delivers state-of-the-art performance across text, vision, tool use, and structured output generation.
Access GPT-5.5 through:
- OpenAI API — direct REST or SDK access at api.openai.com
- ChatGPT Plus / Pro — consumer-facing, rate-limited chat interface
- Microsoft Azure OpenAI Service — enterprise deployment with data residency and compliance controls
- OpenAI Assistants API v2 — managed infrastructure with Threads, Runs, file search, and code interpreter
For developers building production AI features — agents, RAG pipelines, structured data extraction, multimodal applications — GPT-5.5 is the current benchmark. This review covers everything: pricing, context limits, benchmarks, code examples, and how GPT-5.5 stacks up against GPT-4o, Claude Sonnet 4.6, and o3.
GPT-5.5 Pricing (API)
OpenAI uses a per-token pricing model billed in increments of 1 million tokens. Here is the full breakdown as of mid-2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| gpt-5-5 | $2.00 | $8.00 | Standard; vision included |
| gpt-5-5 (cached input) | $1.00 | $8.00 | Automatic prompt cache discount |
| gpt-5-5-16k | $3.00 | $12.00 | Extended context / output tasks |
| gpt-5-5 Batch API | $1.00 | $4.00 | 50% discount; async, up to 24hr turnaround |
| gpt-4o | $0.50 | $1.50 | Best-value workhorse |
| o3 | $10.00 | $30.00 | Advanced reasoning; 5x more expensive |
Vision pricing: Images are billed by tile (512×512 px each). A typical 1024×1024 image costs roughly 170 additional tokens. There is no separate vision surcharge — it uses the same token rates as text.
Free tier: New OpenAI API accounts receive $5 in free credits, valid for 90 days. Sufficient for exploratory testing but not sustained development.
Rate limits by tier:
- Tier 1 (new accounts, less than $50 cumulative spend): 500 RPM, 30,000 TPM, 10,000 RPD
- Tier 2 ($50+ spend): 5,000 RPM, 450,000 TPM
- Tier 3 ($500+ spend): 10,000 RPM, 800,000 TPM
- Enterprise: negotiated limits; SLA commitments available
Cost optimization tips: Use the Batch API for any workload that tolerates up to 24-hour latency (analytics, offline enrichment, bulk classification). Use prompt caching by placing your system prompt and repeated context at the very start of every API call — OpenAI automatically applies the 50% cache discount when a prefix matches a cached version.
Context Window and Limits
GPT-5.5 supports a 128,000-token context window, equivalent to roughly 300 pages of text. This makes it capable of processing:
- Long documents (legal contracts, research papers, financial reports)
- Entire codebases or large single files
- Multi-turn conversation histories spanning dozens of exchanges
- Complex RAG pipelines where many retrieved chunks are injected into a single prompt
Max output tokens:
- Standard (gpt-5-5): 16,384 tokens per response (~12,000 words)
- Extended output (gpt-5-5-pro or with max_tokens override on eligible accounts): 32,768 tokens
Practical note on context quality: Large context windows are only useful if the model actually retrieves information from the far ends reliably. GPT-5.5 performs well across its full context but — like all large-context models — shows slightly degraded retrieval accuracy for facts buried in the middle of very long prompts (the “lost in the middle” effect). Mitigate this by placing the most critical instructions at the start and end of your prompt.
Token counting: Use OpenAI’s tiktoken library to estimate token counts before production deployment. GPT-5.5 uses the same cl100k_base tokenizer as GPT-4. On average, 1 token is approximately 0.75 words in English prose.
GPT-5.5 Benchmark Performance
Benchmark results for GPT-5.5, compared to leading alternatives as of mid-2026:
| Benchmark | GPT-5.5 | GPT-4o | Claude Sonnet 4.6 | o3 |
|---|---|---|---|---|
| MMLU (knowledge) | 89.4% | 86.4% | 88.7% | 91.2% |
| HumanEval (coding) | 91.0% | 87.1% | 89.3% | 94.8% |
| MATH-500 | 94.8% | 76.6% | 90.1% | 97.9% |
| GPQA Diamond (science) | 78.0% | 53.6% | 75.4% | 87.7% |
| LMSYS Arena ELO (mid-2026) | Top-5 | Top-15 | Top-5 | Top-3 |
Where GPT-5.5 leads:
- Instruction following: GPT-5.5 reliably adheres to detailed, multi-constraint system prompts — a significant improvement over GPT-4-turbo and competitive with o3 for non-reasoning tasks
- Tool use and function calling: Highest reliability in class for structured function call generation, parallel tool invocation, and schema adherence
- Structured output: Guaranteed valid JSON output mode, schema-constrained responses, and logprob-calibrated confidence
- Coding: Outperforms GPT-4o by 4+ points on HumanEval; excellent at debugging, refactoring, and explaining existing code
Where competitors have an edge:
- Mathematical reasoning (o3): For theorem proving, multi-step numerical problems, and scientific reasoning chains, o3 remains materially better
- Long-context retrieval (Claude Sonnet 4.6): Claude’s architecture retrieves facts more reliably from the middle of very long documents
- Creative writing (Claude Sonnet 4.6): More nuanced voice and stylistic range for creative or narrative tasks
Multimodal Capabilities
GPT-5.5 is natively multimodal for text and images. Additional modalities (audio, structured data) are handled via companion endpoints.
Vision
Pass images as base64-encoded strings or URLs in the messages array. Supported formats: JPEG, PNG, GIF (static), WebP — up to 20MB per image and up to 10 images per request in the standard tier.
Use cases covered well by GPT-5.5 vision:
- Image description and alt-text generation
- Chart and graph data extraction
- Document OCR and form parsing
- Screenshot analysis (UI debugging, QA automation)
- Visual comparison (before/after, product vs. reference)
- Medical image interpretation (informational — not a medical device)
What it cannot do: Native video understanding is not supported in GPT-5.5. For video analysis, use Google Gemini 2.5 Pro (native video frames) or extract key frames yourself before sending to GPT-5.5.
Audio
Audio is handled by companion endpoints, not the main chat completions API:
- Speech-to-text: POST /v1/audio/transcriptions using the Whisper model. Supports MP3, MP4, MPEG, MPGA, M4A, WAV, WebM. Max file: 25MB. Excellent multilingual accuracy.
- Text-to-speech: POST /v1/audio/speech using tts-1 (optimized for speed) or tts-1-hd (higher fidelity). Six voice options. Outputs MP3, Opus, AAC, or FLAC.
- Real-time audio (Realtime API): WebSocket-based streaming audio I/O for conversational voice agents. Low-latency but higher cost — evaluate against your latency requirements.
Tool Use and Function Calling
GPT-5.5’s function calling capability is arguably its strongest differentiator over competing models. It provides several layers of structured output control:
Function Calling Modes
- Auto: Model decides when to call a function vs. respond in text. Best for general-purpose agents.
- Required: Model must call a function on every turn. Use when you always need structured output.
- None: Disables function calling entirely for a specific request.
- Specific function: tool_choice: {“type”: “function”, “function”: {“name”: “my_function”}} forces a specific function call.
Parallel Function Calls
GPT-5.5 can call multiple functions in a single model turn — for example, querying a weather API and a database simultaneously. This reduces round-trips in agent loops and can cut latency by 40-60% compared to sequential single-function calls.
Structured Outputs (JSON Schema Mode)
Beyond JSON mode (which forces syntactically valid JSON), GPT-5.5 supports schema-constrained structured outputs: you provide a JSON Schema object and the model guarantees its response matches the schema exactly. No post-processing validation needed. Critical for production pipelines.
Assistants API v2
For complex agentic workflows, the Assistants API provides managed infrastructure:
- Threads: Persistent conversation histories without client-side state management
- Runs: Async execution with status polling
- File Search: Managed vector store plus retrieval, no separate embedding pipeline needed
- Code Interpreter: Sandboxed Python execution, file upload/download, chart generation
The Assistants API is the right choice when you want OpenAI to manage state and retrieval. The raw Chat Completions API gives you more control and is typically cheaper for simpler use cases.
Streaming and API Features
GPT-5.5 supports the full range of OpenAI’s advanced API features:
Streaming (Server-Sent Events)
Set stream: true to receive tokens as they are generated. This dramatically improves perceived latency for user-facing applications — users see text appear progressively rather than waiting for the full response. Streaming is fully supported alongside function calls: you receive partial delta objects as the model generates function arguments.
Logprobs
Set logprobs: true and top_logprobs: N (max 20) to receive the log-probability of each generated token, plus the top N alternatives. Use cases: calibrated confidence scoring, detecting uncertain model outputs, classification tasks where you need probability distributions rather than argmax.
Seed Parameter
Pass a seed integer for approximately reproducible outputs. The model will attempt to return the same result for the same prompt plus seed combination. Note: reproducibility is not guaranteed (model updates can change outputs even with the same seed). Use the system_fingerprint field in the response to detect when the underlying model version has changed.
JSON Mode
Set response_format: {“type”: “json_object”} to guarantee syntactically valid JSON output. Pair with explicit JSON structure instructions in your system prompt. For schema-level guarantees, use the Structured Outputs feature instead.
Quick Start Code Examples
Python — Basic Chat Completion
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-5-5",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
],
max_tokens=1024,
temperature=0.7
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
Python — Streaming Response
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-5-5",
messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # newline at end
JavaScript / Node.js — Streaming with OpenAI SDK
import OpenAI from "openai";
const client = new OpenAI(); // reads OPENAI_API_KEY from process.env
async function streamResponse() {
const stream = await client.chat.completions.create({
model: "gpt-5-5",
messages: [
{ role: "system", content: "You are a concise technical writer." },
{ role: "user", content: "Explain the event loop in Node.js." }
],
stream: true,
});
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content || "";
process.stdout.write(text);
}
console.log();
}
streamResponse();
Python — Structured Output (JSON Schema)
from openai import OpenAI
from pydantic import BaseModel
from typing import List
client = OpenAI()
class ProductReview(BaseModel):
sentiment: str # "positive" | "negative" | "neutral"
score: float # 1.0 to 5.0
key_points: List[str]
summary: str
response = client.beta.chat.completions.parse(
model="gpt-5-5",
messages=[
{"role": "system", "content": "Extract structured review data from customer feedback."},
{"role": "user", "content": "This laptop is fantastic. Battery lasts 12 hours, keyboard is great, but the fan is loud. 4 out of 5 stars."}
],
response_format=ProductReview,
)
review = response.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Score: {review.score}")
print(f"Key points: {review.key_points}")
Function Calling Example
This example shows a complete function-calling workflow: defining a tool, making the API call, and handling the tool result.
import json
from openai import OpenAI
client = OpenAI()
# Step 1: Define your function
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g. London, UK"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
messages = [{"role": "user", "content": "What's the weather like in Tokyo right now?"}]
# Step 2: Send to GPT-5.5 with tools available
response = client.chat.completions.create(
model="gpt-5-5",
messages=messages,
tools=tools,
tool_choice="auto"
)
response_message = response.choices[0].message
# Step 3: Check if the model wants to call a function
if response_message.tool_calls:
tool_call = response_message.tool_calls[0]
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
# Step 4: Execute your actual function
# (In production, call your real weather API here)
weather_result = {
"location": function_args["location"],
"temperature": "22 degrees C",
"condition": "Partly cloudy",
"humidity": "65%"
}
# Step 5: Send the result back to the model
messages.append(response_message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(weather_result)
})
final_response = client.chat.completions.create(
model="gpt-5-5",
messages=messages
)
print(final_response.choices[0].message.content)
# Output: The current weather in Tokyo is 22 degrees C and partly cloudy, with 65% humidity.
GPT-5.5 vs GPT-4o: When to Upgrade
GPT-4o remains an excellent model and at $0.50/$1.50 per million tokens, it is four times cheaper than GPT-5.5. The upgrade decision should be driven by what the performance difference is worth to your application.
Stick with GPT-4o when:
- High-volume, simple tasks: Classification, sentiment analysis, basic summarization, FAQ answering — tasks where GPT-4o already achieves near-perfect accuracy
- Cost-sensitive pipelines: If you’re processing millions of documents per day, the 4x cost difference compounds fast. At 100M tokens/day: GPT-4o costs ~$50; GPT-5.5 costs ~$200
- Latency-tolerant batch work: The Batch API at 50% discount makes GPT-4o ($0.25/1M) extremely competitive for async workloads
- Existing pipelines that work: If your GPT-4o pipeline already meets quality thresholds, switching for its own sake is cost without benefit
Upgrade to GPT-5.5 when:
- Complex multi-step reasoning: Tasks requiring the model to hold multiple constraints in mind, resolve conflicts, or reason through ambiguous instructions
- Production tool use / agents: GPT-5.5’s function calling reliability is measurably better — fewer malformed JSON outputs, better argument extraction, more consistent parallel tool use
- Nuanced instruction following: Long, complex system prompts with many constraints and edge cases
- User-facing quality matters: When output quality directly impacts user experience and retention — content generation, AI assistants, code review tools
- Schema-constrained structured output: The Structured Outputs feature (guaranteed schema adherence) is most robust on GPT-5.5
Hybrid strategy: Many production systems route by task complexity — GPT-4o for simple queries (fast, cheap), GPT-5.5 for complex ones. The OpenAI Responses API supports model routing based on query characteristics.
GPT-5.5 vs Claude Sonnet 4.6 API
Both models sit in the same price tier (~$2-3/1M input tokens) and offer 128k-200k context windows, making this comparison highly relevant for developers choosing a primary API provider.
| Capability | GPT-5.5 | Claude Sonnet 4.6 |
|---|---|---|
| Input price | $2.00/1M | $3.00/1M |
| Output price | $8.00/1M | $15.00/1M |
| Context window | 128k tokens | 200k tokens |
| Max output | 16,384 tokens | 8,192 tokens |
| Function calling | Excellent | Very good |
| Structured output (schema) | Native guaranteed | Via tool use |
| Long-context retrieval | Very good | Excellent |
| Creative writing | Very good | Excellent |
| Coding (HumanEval) | 91.0% | 89.3% |
| Ecosystem / integrations | Extensive (Azure, LangChain, etc.) | Growing |
GPT-5.5 wins on:
- Tool use reliability and function calling precision
- Native structured output with schema enforcement
- Breadth of ecosystem integrations (Azure OpenAI, LangChain, LlamaIndex, AutoGPT)
- Higher max output token limit (16k vs 8k)
- Lower cost at standard tiers
Claude Sonnet 4.6 wins on:
- Longer context window (200k vs 128k) — matters for very large documents
- More reliable retrieval from the middle of long contexts
- Creative writing quality — voice, nuance, narrative range
- Instruction following on ambiguous, open-ended creative tasks
- Extended thinking mode for complex reasoning (comparable to o3 at lower cost)
Developer recommendation: If your application is primarily tool-use, agent, or structured-output driven — GPT-5.5. If your application is primarily long-document processing, creative generation, or requires the largest possible context — Claude Sonnet 4.6. For most teams, it’s worth benchmarking both on your specific pipeline before committing.
GPT-5.5 vs o3 (Reasoning Model)
o3 is OpenAI’s dedicated reasoning model, priced at $10/1M input / $30/1M output — 5x more expensive than GPT-5.5. The performance premium is real but highly task-dependent.
When o3 is worth the premium:
- Mathematical reasoning: Theorem proving, symbolic computation, complex numerical problems (o3 scores 97.9% on MATH-500 vs GPT-5.5’s 94.8%)
- Multi-step logical chains: Tasks requiring the model to build and verify a chain of reasoning across many steps without losing track
- Scientific reasoning: GPQA Diamond score of 87.7% vs GPT-5.5’s 78.0% — nearly a 10-point gap on graduate-level science questions
- Code competition problems: Competitive programming, algorithm design, formal verification
When GPT-5.5 is the better choice:
- Everything else: For the vast majority of production use cases — content generation, document processing, tool use agents, data extraction — GPT-5.5 delivers 90%+ of o3’s quality at 20% of the cost
- Latency-sensitive applications: o3 is slower due to extended chain-of-thought reasoning; GPT-5.5 is significantly faster for real-time applications
- High-volume pipelines: At 5x the price, o3 is prohibitive at scale unless the task genuinely requires its reasoning depth
Smart routing: The OpenAI Responses API supports routing between models. A common pattern: classify incoming queries by complexity, route simple/medium to GPT-5.5, route hard reasoning tasks to o3. This achieves near-o3 quality at much lower average cost.
Rate Limits and Tiers
Understanding rate limits is critical for production architecture. OpenAI uses a tiered system based on cumulative account spend:
| Tier | Requirement | RPM | TPM | RPD |
|---|---|---|---|---|
| Tier 1 | New accounts / less than $50 spend | 500 | 30,000 | 10,000 |
| Tier 2 | $50+ spend, 7+ days | 5,000 | 450,000 | Unlimited |
| Tier 3 | $500+ spend, 14+ days | 10,000 | 800,000 | Unlimited |
| Tier 4 | $5,000+ spend, 30+ days | 15,000 | 2,000,000 | Unlimited |
| Enterprise | Custom agreement | Negotiated | Negotiated | Negotiated |
RPM = requests per minute, TPM = tokens per minute, RPD = requests per day
Architectural implications:
- Tier 1 at 500 RPM: Fine for development and low-traffic production. If you need more throughput immediately, consider the Batch API (no RPM limit for async workloads)
- Caching strategies: At high TPM, repeated identical prompts waste quota and money. Implement application-level caching for deterministic tasks
- Exponential backoff: Always implement retry logic with exponential backoff for 429 (rate limit) and 500/503 (transient server) errors
- TPM vs RPM: For large-context requests (50k+ tokens each), TPM is usually the binding constraint, not RPM. Design your batching strategy accordingly
Caching and Cost Optimization
Production use of GPT-5.5 at scale requires deliberate cost management. Here are the highest-impact strategies:
1. Prompt Caching (Automatic — 50% Discount)
OpenAI automatically caches prompt prefixes and applies a 50% discount when a cached prefix is reused. To maximize cache hits:
- Put static content first: System prompt, fixed instructions, reference documents — all at the top of your messages array
- Put dynamic content last: User input, session-specific context — at the end
- Do not rotate system prompts unnecessarily: Each unique system prompt is cached separately; constant variation defeats caching
2. Batch API (50% Discount, Async)
For any workload that does not require real-time responses — analytics, offline document processing, bulk content generation, model evaluation — the Batch API halves your costs. Submit jobs via POST /v1/batches, poll for completion, retrieve results. Turnaround: up to 24 hours (typically much faster for small batches).
3. Token Budgeting
- Use max_tokens to cap output length — unbounded outputs on long-context tasks can generate expensive surprises
- Trim conversation history in multi-turn applications — you rarely need every turn for context; use summarization to compress older history
- For classification tasks, use logprobs and extract the answer from token probabilities rather than generating a full text response
4. Model Tiering
Route by complexity: GPT-4o mini for trivial tasks (translation, basic classification), GPT-4o for medium tasks, GPT-5.5 only for tasks that genuinely benefit from its additional capabilities. A well-tiered routing strategy can cut average cost by 60-80% with minimal quality loss.
5. tiktoken for Pre-flight Cost Estimation
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
def estimate_cost(prompt: str, expected_output_tokens: int = 500) -> dict:
input_tokens = len(enc.encode(prompt))
input_cost = (input_tokens / 1_000_000) * 2.00 # $2/1M input
output_cost = (expected_output_tokens / 1_000_000) * 8.00 # $8/1M output
return {
"input_tokens": input_tokens,
"estimated_output_tokens": expected_output_tokens,
"estimated_cost_usd": round(input_cost + output_cost, 6)
}
print(estimate_cost("Summarize this 50-page report..." + long_document))
Azure OpenAI: Enterprise Deployment
GPT-5.5 is available on Azure OpenAI Service for enterprise customers who need compliance, data governance, or existing Azure infrastructure integration.
Why Azure OpenAI:
- Data residency: Deploy in specific Azure regions (US, EU, Asia) to meet data sovereignty requirements
- Network isolation: Private endpoints and VNet integration — model traffic never leaves your network perimeter
- Contractual data privacy: Microsoft’s enterprise agreement guarantees your data is not used for OpenAI model training
- SLA commitments: 99.9% uptime SLA backed by Microsoft — versus OpenAI’s best-effort reliability
- Unified billing: Costs flow through existing Azure billing and committed-use discounts
- Compliance certifications: SOC 2 Type II, ISO 27001, HIPAA-eligible, FedRAMP (select regions)
API compatibility:
Azure OpenAI is largely API-compatible with OpenAI’s API. The main differences: authentication uses Azure AD tokens or API keys scoped to your deployment, and the endpoint URL follows the pattern https://your-resource.openai.azure.com/openai/deployments/deployment-name/chat/completions?api-version=2024-02-01.
The OpenAI Python SDK supports Azure natively:
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://your-resource.openai.azure.com",
api_key="your-azure-api-key",
api_version="2024-02-01"
)
response = client.chat.completions.create(
model="your-gpt-5-5-deployment-name", # Azure uses deployment names
messages=[{"role": "user", "content": "Hello from Azure!"}]
)
When to choose Azure vs OpenAI direct: If your organization already uses Azure, has strict data governance requirements, or is in a regulated industry (healthcare, finance, government) — Azure OpenAI is the right choice. For startups and individual developers, OpenAI direct is faster to set up and has the latest model versions first.
Verdict
GPT-5.5 is OpenAI’s most capable general-purpose API model for 2026 and sits at the center of the most important capability cluster for production AI development: reliable tool use, schema-constrained structured outputs, and instruction-following precision.
What you get for $2/1M input tokens:
- The most reliable function-calling implementation in class
- Guaranteed schema adherence via Structured Outputs
- 128k context window suitable for nearly all production use cases
- Best-in-class HumanEval coding performance at this price tier
- Multimodal vision at no extra cost
- A mature ecosystem — LangChain, LlamaIndex, AutoGPT, Semantic Kernel all have first-class GPT-5.5 support
The trade-offs are clear: If you need mathematical reasoning at the frontier, o3 is worth the 5x premium. If you need the largest context window with the best long-document retrieval, Claude Sonnet 4.6 200k is worth considering. If you are running high-volume simple tasks, GPT-4o at $0.50/1M input is four times cheaper.
But for the developer building a real product — a code assistant, a customer support agent, a document intelligence pipeline, a multi-tool AI workflow — GPT-5.5 is the model to start with and the benchmark to beat.
Rating: 4.5 / 5 — Deducted half a point for cost relative to GPT-4o (the upgrade decision requires careful task analysis) and for the lack of native video understanding.
Target Audience
Ideal for: Complex code generation, structured extraction, high-value segmentation logic, and user-facing agents that need sub-2s time-to-first-token.