Bottom Line

GPT-5.5 is OpenAI’s premium standard API tier — 1M context, stronger tool dispatch than 5.4, and top Terminal-Bench scores. Use for quality-critical work; escalate to Pro only when errors are costly.

GPT-5.5 API: Quick Summary

OpenAI’s GPT-5.5 is the company’s flagship multimodal model as of mid-2026, representing a significant step forward from the GPT-4 family. Available under the model ID gpt-5-5 (standard) and gpt-5-5-pro for enhanced reasoning tasks, GPT-5.5 delivers state-of-the-art performance across text, vision, tool use, and structured output generation.

Access GPT-5.5 through:

OpenAI API — direct REST or SDK access at api.openai.com
ChatGPT Plus / Pro — consumer-facing, rate-limited chat interface
Microsoft Azure OpenAI Service — enterprise deployment with data residency and compliance controls
OpenAI Assistants API v2 — managed infrastructure with Threads, Runs, file search, and code interpreter

For developers building production AI features — agents, RAG pipelines, structured data extraction, multimodal applications — GPT-5.5 is the current benchmark. This review covers everything: pricing, context limits, benchmarks, code examples, and how GPT-5.5 stacks up against GPT-4o, Claude Sonnet 4.6, and o3.

GPT-5.5 Pricing (API)

OpenAI uses a per-token pricing model billed in increments of 1 million tokens. Here is the full breakdown as of mid-2026:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
gpt-5-5	$2.00	$8.00	Standard; vision included
gpt-5-5 (cached input)	$1.00	$8.00	Automatic prompt cache discount
gpt-5-5-16k	$3.00	$12.00	Extended context / output tasks
gpt-5-5 Batch API	$1.00	$4.00	50% discount; async, up to 24hr turnaround
gpt-4o	$0.50	$1.50	Best-value workhorse
o3	$10.00	$30.00	Advanced reasoning; 5x more expensive

Vision pricing: Images are billed by tile (512×512 px each). A typical 1024×1024 image costs roughly 170 additional tokens. There is no separate vision surcharge — it uses the same token rates as text.

Free tier: New OpenAI API accounts receive $5 in free credits, valid for 90 days. Sufficient for exploratory testing but not sustained development.

Rate limits by tier:

Tier 1 (new accounts, less than $50 cumulative spend): 500 RPM, 30,000 TPM, 10,000 RPD
Tier 2 ($50+ spend): 5,000 RPM, 450,000 TPM
Tier 3 ($500+ spend): 10,000 RPM, 800,000 TPM
Enterprise: negotiated limits; SLA commitments available

Cost optimization tips: Use the Batch API for any workload that tolerates up to 24-hour latency (analytics, offline enrichment, bulk classification). Use prompt caching by placing your system prompt and repeated context at the very start of every API call — OpenAI automatically applies the 50% cache discount when a prefix matches a cached version.

Context Window and Limits

GPT-5.5 supports a 128,000-token context window, equivalent to roughly 300 pages of text. This makes it capable of processing:

Long documents (legal contracts, research papers, financial reports)
Entire codebases or large single files
Multi-turn conversation histories spanning dozens of exchanges
Complex RAG pipelines where many retrieved chunks are injected into a single prompt

Max output tokens:

Standard (gpt-5-5): 16,384 tokens per response (~12,000 words)
Extended output (gpt-5-5-pro or with max_tokens override on eligible accounts): 32,768 tokens

Practical note on context quality: Large context windows are only useful if the model actually retrieves information from the far ends reliably. GPT-5.5 performs well across its full context but — like all large-context models — shows slightly degraded retrieval accuracy for facts buried in the middle of very long prompts (the “lost in the middle” effect). Mitigate this by placing the most critical instructions at the start and end of your prompt.

Token counting: Use OpenAI’s tiktoken library to estimate token counts before production deployment. GPT-5.5 uses the same cl100k_base tokenizer as GPT-4. On average, 1 token is approximately 0.75 words in English prose.

GPT-5.5 Benchmark Performance

Benchmark results for GPT-5.5, compared to leading alternatives as of mid-2026:

Benchmark	GPT-5.5	GPT-4o	Claude Sonnet 4.6	o3
MMLU (knowledge)	89.4%	86.4%	88.7%	91.2%
HumanEval (coding)	91.0%	87.1%	89.3%	94.8%
MATH-500	94.8%	76.6%	90.1%	97.9%
GPQA Diamond (science)	78.0%	53.6%	75.4%	87.7%
LMSYS Arena ELO (mid-2026)	Top-5	Top-15	Top-5	Top-3

Where GPT-5.5 leads:

Instruction following: GPT-5.5 reliably adheres to detailed, multi-constraint system prompts — a significant improvement over GPT-4-turbo and competitive with o3 for non-reasoning tasks
Tool use and function calling: Highest reliability in class for structured function call generation, parallel tool invocation, and schema adherence
Structured output: Guaranteed valid JSON output mode, schema-constrained responses, and logprob-calibrated confidence
Coding: Outperforms GPT-4o by 4+ points on HumanEval; excellent at debugging, refactoring, and explaining existing code

Where competitors have an edge:

Mathematical reasoning (o3): For theorem proving, multi-step numerical problems, and scientific reasoning chains, o3 remains materially better
Long-context retrieval (Claude Sonnet 4.6): Claude’s architecture retrieves facts more reliably from the middle of very long documents
Creative writing (Claude Sonnet 4.6): More nuanced voice and stylistic range for creative or narrative tasks

Multimodal Capabilities

GPT-5.5 is natively multimodal for text and images. Additional modalities (audio, structured data) are handled via companion endpoints.

Vision

Pass images as base64-encoded strings or URLs in the messages array. Supported formats: JPEG, PNG, GIF (static), WebP — up to 20MB per image and up to 10 images per request in the standard tier.

Use cases covered well by GPT-5.5 vision:

Image description and alt-text generation
Chart and graph data extraction
Document OCR and form parsing
Screenshot analysis (UI debugging, QA automation)
Visual comparison (before/after, product vs. reference)
Medical image interpretation (informational — not a medical device)

What it cannot do: Native video understanding is not supported in GPT-5.5. For video analysis, use Google Gemini 2.5 Pro (native video frames) or extract key frames yourself before sending to GPT-5.5.

Audio

Audio is handled by companion endpoints, not the main chat completions API:

Speech-to-text: POST /v1/audio/transcriptions using the Whisper model. Supports MP3, MP4, MPEG, MPGA, M4A, WAV, WebM. Max file: 25MB. Excellent multilingual accuracy.
Text-to-speech: POST /v1/audio/speech using tts-1 (optimized for speed) or tts-1-hd (higher fidelity). Six voice options. Outputs MP3, Opus, AAC, or FLAC.
Real-time audio (Realtime API): WebSocket-based streaming audio I/O for conversational voice agents. Low-latency but higher cost — evaluate against your latency requirements.

Tool Use and Function Calling

GPT-5.5’s function calling capability is arguably its strongest differentiator over competing models. It provides several layers of structured output control:

Function Calling Modes

Auto: Model decides when to call a function vs. respond in text. Best for general-purpose agents.
Required: Model must call a function on every turn. Use when you always need structured output.
None: Disables function calling entirely for a specific request.
Specific function: tool_choice: {“type”: “function”, “function”: {“name”: “my_function”}} forces a specific function call.

Parallel Function Calls

GPT-5.5 can call multiple functions in a single model turn — for example, querying a weather API and a database simultaneously. This reduces round-trips in agent loops and can cut latency by 40-60% compared to sequential single-function calls.

Structured Outputs (JSON Schema Mode)

Beyond JSON mode (which forces syntactically valid JSON), GPT-5.5 supports schema-constrained structured outputs: you provide a JSON Schema object and the model guarantees its response matches the schema exactly. No post-processing validation needed. Critical for production pipelines.

Assistants API v2

For complex agentic workflows, the Assistants API provides managed infrastructure:

Threads: Persistent conversation histories without client-side state management
Runs: Async execution with status polling
File Search: Managed vector store plus retrieval, no separate embedding pipeline needed
Code Interpreter: Sandboxed Python execution, file upload/download, chart generation

The Assistants API is the right choice when you want OpenAI to manage state and retrieval. The raw Chat Completions API gives you more control and is typically cheaper for simpler use cases.

Streaming and API Features

GPT-5.5 supports the full range of OpenAI’s advanced API features:

Streaming (Server-Sent Events)

Set stream: true to receive tokens as they are generated. This dramatically improves perceived latency for user-facing applications — users see text appear progressively rather than waiting for the full response. Streaming is fully supported alongside function calls: you receive partial delta objects as the model generates function arguments.

Logprobs

Set logprobs: true and top_logprobs: N (max 20) to receive the log-probability of each generated token, plus the top N alternatives. Use cases: calibrated confidence scoring, detecting uncertain model outputs, classification tasks where you need probability distributions rather than argmax.

Seed Parameter

Pass a seed integer for approximately reproducible outputs. The model will attempt to return the same result for the same prompt plus seed combination. Note: reproducibility is not guaranteed (model updates can change outputs even with the same seed). Use the system_fingerprint field in the response to detect when the underlying model version has changed.

JSON Mode

Set response_format: {“type”: “json_object”} to guarantee syntactically valid JSON output. Pair with explicit JSON structure instructions in your system prompt. For schema-level guarantees, use the Structured Outputs feature instead.

Quick Start Code Examples

Python — Basic Chat Completion

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-5-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
    max_tokens=1024,
    temperature=0.7
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Python — Streaming Response

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-5-5",
    messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()  # newline at end

JavaScript / Node.js — Streaming with OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI(); // reads OPENAI_API_KEY from process.env

async function streamResponse() {
  const stream = await client.chat.completions.create({
    model: "gpt-5-5",
    messages: [
      { role: "system", content: "You are a concise technical writer." },
      { role: "user", content: "Explain the event loop in Node.js." }
    ],
    stream: true,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content || "";
    process.stdout.write(text);
  }
  console.log();
}

streamResponse();

Python — Structured Output (JSON Schema)

from openai import OpenAI
from pydantic import BaseModel
from typing import List

client = OpenAI()

class ProductReview(BaseModel):
    sentiment: str  # "positive" | "negative" | "neutral"
    score: float    # 1.0 to 5.0
    key_points: List[str]
    summary: str

response = client.beta.chat.completions.parse(
    model="gpt-5-5",
    messages=[
        {"role": "system", "content": "Extract structured review data from customer feedback."},
        {"role": "user", "content": "This laptop is fantastic. Battery lasts 12 hours, keyboard is great, but the fan is loud. 4 out of 5 stars."}
    ],
    response_format=ProductReview,
)

review = response.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Score: {review.score}")
print(f"Key points: {review.key_points}")

Function Calling Example

This example shows a complete function-calling workflow: defining a tool, making the API call, and handling the tool result.

import json
from openai import OpenAI

client = OpenAI()

# Step 1: Define your function
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country, e.g. London, UK"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather like in Tokyo right now?"}]

# Step 2: Send to GPT-5.5 with tools available
response = client.chat.completions.create(
    model="gpt-5-5",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

response_message = response.choices[0].message

# Step 3: Check if the model wants to call a function
if response_message.tool_calls:
    tool_call = response_message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)

    # Step 4: Execute your actual function
    # (In production, call your real weather API here)
    weather_result = {
        "location": function_args["location"],
        "temperature": "22 degrees C",
        "condition": "Partly cloudy",
        "humidity": "65%"
    }

    # Step 5: Send the result back to the model
    messages.append(response_message)
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(weather_result)
    })

    final_response = client.chat.completions.create(
        model="gpt-5-5",
        messages=messages
    )

    print(final_response.choices[0].message.content)
    # Output: The current weather in Tokyo is 22 degrees C and partly cloudy, with 65% humidity.

GPT-5.5 vs GPT-4o: When to Upgrade

GPT-4o remains an excellent model and at $0.50/$1.50 per million tokens, it is four times cheaper than GPT-5.5. The upgrade decision should be driven by what the performance difference is worth to your application.

Stick with GPT-4o when:

High-volume, simple tasks: Classification, sentiment analysis, basic summarization, FAQ answering — tasks where GPT-4o already achieves near-perfect accuracy
Cost-sensitive pipelines: If you’re processing millions of documents per day, the 4x cost difference compounds fast. At 100M tokens/day: GPT-4o costs ~$50; GPT-5.5 costs ~$200
Latency-tolerant batch work: The Batch API at 50% discount makes GPT-4o ($0.25/1M) extremely competitive for async workloads
Existing pipelines that work: If your GPT-4o pipeline already meets quality thresholds, switching for its own sake is cost without benefit

Upgrade to GPT-5.5 when:

Complex multi-step reasoning: Tasks requiring the model to hold multiple constraints in mind, resolve conflicts, or reason through ambiguous instructions
Production tool use / agents: GPT-5.5’s function calling reliability is measurably better — fewer malformed JSON outputs, better argument extraction, more consistent parallel tool use
Nuanced instruction following: Long, complex system prompts with many constraints and edge cases
User-facing quality matters: When output quality directly impacts user experience and retention — content generation, AI assistants, code review tools
Schema-constrained structured output: The Structured Outputs feature (guaranteed schema adherence) is most robust on GPT-5.5

Hybrid strategy: Many production systems route by task complexity — GPT-4o for simple queries (fast, cheap), GPT-5.5 for complex ones. The OpenAI Responses API supports model routing based on query characteristics.

GPT-5.5 vs Claude Sonnet 4.6 API

Both models sit in the same price tier (~$2-3/1M input tokens) and offer 128k-200k context windows, making this comparison highly relevant for developers choosing a primary API provider.

Capability	GPT-5.5	Claude Sonnet 4.6
Input price	$2.00/1M	$3.00/1M
Output price	$8.00/1M	$15.00/1M
Context window	128k tokens	200k tokens
Max output	16,384 tokens	8,192 tokens
Function calling	Excellent	Very good
Structured output (schema)	Native guaranteed	Via tool use
Long-context retrieval	Very good	Excellent
Creative writing	Very good	Excellent
Coding (HumanEval)	91.0%	89.3%
Ecosystem / integrations	Extensive (Azure, LangChain, etc.)	Growing

GPT-5.5 wins on:

Tool use reliability and function calling precision
Native structured output with schema enforcement
Breadth of ecosystem integrations (Azure OpenAI, LangChain, LlamaIndex, AutoGPT)
Higher max output token limit (16k vs 8k)
Lower cost at standard tiers

Claude Sonnet 4.6 wins on:

Longer context window (200k vs 128k) — matters for very large documents
More reliable retrieval from the middle of long contexts
Creative writing quality — voice, nuance, narrative range
Instruction following on ambiguous, open-ended creative tasks
Extended thinking mode for complex reasoning (comparable to o3 at lower cost)

Developer recommendation: If your application is primarily tool-use, agent, or structured-output driven — GPT-5.5. If your application is primarily long-document processing, creative generation, or requires the largest possible context — Claude Sonnet 4.6. For most teams, it’s worth benchmarking both on your specific pipeline before committing.

GPT-5.5 vs o3 (Reasoning Model)

o3 is OpenAI’s dedicated reasoning model, priced at $10/1M input / $30/1M output — 5x more expensive than GPT-5.5. The performance premium is real but highly task-dependent.

When o3 is worth the premium:

Mathematical reasoning: Theorem proving, symbolic computation, complex numerical problems (o3 scores 97.9% on MATH-500 vs GPT-5.5’s 94.8%)
Multi-step logical chains: Tasks requiring the model to build and verify a chain of reasoning across many steps without losing track
Scientific reasoning: GPQA Diamond score of 87.7% vs GPT-5.5’s 78.0% — nearly a 10-point gap on graduate-level science questions
Code competition problems: Competitive programming, algorithm design, formal verification

When GPT-5.5 is the better choice:

Everything else: For the vast majority of production use cases — content generation, document processing, tool use agents, data extraction — GPT-5.5 delivers 90%+ of o3’s quality at 20% of the cost
Latency-sensitive applications: o3 is slower due to extended chain-of-thought reasoning; GPT-5.5 is significantly faster for real-time applications
High-volume pipelines: At 5x the price, o3 is prohibitive at scale unless the task genuinely requires its reasoning depth

Smart routing: The OpenAI Responses API supports routing between models. A common pattern: classify incoming queries by complexity, route simple/medium to GPT-5.5, route hard reasoning tasks to o3. This achieves near-o3 quality at much lower average cost.

Rate Limits and Tiers

Understanding rate limits is critical for production architecture. OpenAI uses a tiered system based on cumulative account spend:

Tier	Requirement	RPM	TPM	RPD
Tier 1	New accounts / less than $50 spend	500	30,000	10,000
Tier 2	$50+ spend, 7+ days	5,000	450,000	Unlimited
Tier 3	$500+ spend, 14+ days	10,000	800,000	Unlimited
Tier 4	$5,000+ spend, 30+ days	15,000	2,000,000	Unlimited
Enterprise	Custom agreement	Negotiated	Negotiated	Negotiated

RPM = requests per minute, TPM = tokens per minute, RPD = requests per day

Architectural implications:

Tier 1 at 500 RPM: Fine for development and low-traffic production. If you need more throughput immediately, consider the Batch API (no RPM limit for async workloads)
Caching strategies: At high TPM, repeated identical prompts waste quota and money. Implement application-level caching for deterministic tasks
Exponential backoff: Always implement retry logic with exponential backoff for 429 (rate limit) and 500/503 (transient server) errors
TPM vs RPM: For large-context requests (50k+ tokens each), TPM is usually the binding constraint, not RPM. Design your batching strategy accordingly

Caching and Cost Optimization

Production use of GPT-5.5 at scale requires deliberate cost management. Here are the highest-impact strategies:

1. Prompt Caching (Automatic — 50% Discount)

OpenAI automatically caches prompt prefixes and applies a 50% discount when a cached prefix is reused. To maximize cache hits:

Put static content first: System prompt, fixed instructions, reference documents — all at the top of your messages array
Put dynamic content last: User input, session-specific context — at the end
Do not rotate system prompts unnecessarily: Each unique system prompt is cached separately; constant variation defeats caching

2. Batch API (50% Discount, Async)

For any workload that does not require real-time responses — analytics, offline document processing, bulk content generation, model evaluation — the Batch API halves your costs. Submit jobs via POST /v1/batches, poll for completion, retrieve results. Turnaround: up to 24 hours (typically much faster for small batches).

3. Token Budgeting

Use max_tokens to cap output length — unbounded outputs on long-context tasks can generate expensive surprises
Trim conversation history in multi-turn applications — you rarely need every turn for context; use summarization to compress older history
For classification tasks, use logprobs and extract the answer from token probabilities rather than generating a full text response

4. Model Tiering

Route by complexity: GPT-4o mini for trivial tasks (translation, basic classification), GPT-4o for medium tasks, GPT-5.5 only for tasks that genuinely benefit from its additional capabilities. A well-tiered routing strategy can cut average cost by 60-80% with minimal quality loss.

5. tiktoken for Pre-flight Cost Estimation

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def estimate_cost(prompt: str, expected_output_tokens: int = 500) -> dict:
    input_tokens = len(enc.encode(prompt))
    input_cost = (input_tokens / 1_000_000) * 2.00   # $2/1M input
    output_cost = (expected_output_tokens / 1_000_000) * 8.00  # $8/1M output
    return {
        "input_tokens": input_tokens,
        "estimated_output_tokens": expected_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 6)
    }

print(estimate_cost("Summarize this 50-page report..." + long_document))

Azure OpenAI: Enterprise Deployment

GPT-5.5 is available on Azure OpenAI Service for enterprise customers who need compliance, data governance, or existing Azure infrastructure integration.

Why Azure OpenAI:

Data residency: Deploy in specific Azure regions (US, EU, Asia) to meet data sovereignty requirements
Network isolation: Private endpoints and VNet integration — model traffic never leaves your network perimeter
Contractual data privacy: Microsoft’s enterprise agreement guarantees your data is not used for OpenAI model training
SLA commitments: 99.9% uptime SLA backed by Microsoft — versus OpenAI’s best-effort reliability
Unified billing: Costs flow through existing Azure billing and committed-use discounts
Compliance certifications: SOC 2 Type II, ISO 27001, HIPAA-eligible, FedRAMP (select regions)

API compatibility:

Azure OpenAI is largely API-compatible with OpenAI’s API. The main differences: authentication uses Azure AD tokens or API keys scoped to your deployment, and the endpoint URL follows the pattern https://your-resource.openai.azure.com/openai/deployments/deployment-name/chat/completions?api-version=2024-02-01.

The OpenAI Python SDK supports Azure natively:

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com",
    api_key="your-azure-api-key",
    api_version="2024-02-01"
)

response = client.chat.completions.create(
    model="your-gpt-5-5-deployment-name",  # Azure uses deployment names
    messages=[{"role": "user", "content": "Hello from Azure!"}]
)

When to choose Azure vs OpenAI direct: If your organization already uses Azure, has strict data governance requirements, or is in a regulated industry (healthcare, finance, government) — Azure OpenAI is the right choice. For startups and individual developers, OpenAI direct is faster to set up and has the latest model versions first.

Verdict

GPT-5.5 is OpenAI’s most capable general-purpose API model for 2026 and sits at the center of the most important capability cluster for production AI development: reliable tool use, schema-constrained structured outputs, and instruction-following precision.

What you get for $2/1M input tokens:

The most reliable function-calling implementation in class
Guaranteed schema adherence via Structured Outputs
128k context window suitable for nearly all production use cases
Best-in-class HumanEval coding performance at this price tier
Multimodal vision at no extra cost
A mature ecosystem — LangChain, LlamaIndex, AutoGPT, Semantic Kernel all have first-class GPT-5.5 support

The trade-offs are clear: If you need mathematical reasoning at the frontier, o3 is worth the 5x premium. If you need the largest context window with the best long-document retrieval, Claude Sonnet 4.6 200k is worth considering. If you are running high-volume simple tasks, GPT-4o at $0.50/1M input is four times cheaper.

But for the developer building a real product — a code assistant, a customer support agent, a document intelligence pipeline, a multi-tool AI workflow — GPT-5.5 is the model to start with and the benchmark to beat.

Rating: 4.5 / 5 — Deducted half a point for cost relative to GPT-4o (the upgrade decision requires careful task analysis) and for the lack of native video understanding.

Target Audience

Ideal for: Complex code generation, structured extraction, high-value segmentation logic, and user-facing agents that need sub-2s time-to-first-token.