Field Guide

Codex Models, Token Usage, and Best-Use Scenarios

Updated 2026-05-13 · Re-check OpenAI pricing before high-cost runs.

The simple rule: do not make the biggest model carry every log file, typo fix, and CSS tweak. Use the strongest model where judgment matters, then hand clear execution to cheaper tiers.

Disclosure: StackCapybara may earn a commission if you click some tool links. This does not affect our recommendations. We focus on tested workflows, practical fit, pricing reality, and clear limitations.

Quick answer: choose by risk

High-risk decisionGPT-5.5Best for architecture, hard debugging, and final go/no-go review.
Bounded coding taskGPT-5.4 or GPT-5.3-CodexStrong implementation quality when scope is clear.
Small fix / logs / QAGPT-5.4 miniFast and cost-efficient for reversible, scoped work.
Extraction / tagging / routingGPT-5.4 nanoUse for high-volume structured prep, not risky coding.

Capybara rule of thumb: the heavier the context load, the stricter the task boundary needs to be.

Model routing cards

GPT-5.5

Best for: risky architecture, incident diagnosis, final review.

Avoid for: repetitive extraction or tiny cosmetic fixes.

StackCapybara use: release-gate review before promotion.

GPT-5.4

Best for: important implementation with clear acceptance criteria.

Avoid for: high-ambiguity root-cause investigation.

StackCapybara use: medium feature implementation after design is set.

GPT-5.3-Codex

Best for: Codex-focused coding sessions under budget pressure.

Avoid for: broad strategy or uncertain cross-system failures.

StackCapybara use: bounded code tasks with known file scope.

GPT-5.4 mini

Best for: scoped fixes, log summaries, verification passes.

Avoid for: architecture-level judgment calls.

StackCapybara use: targeted CSS/PHP bug repairs and QA notes.

GPT-5.4 nano

Best for: extraction, tagging, and routing prep.

Avoid for: security-sensitive coding and production-risk reasoning.

StackCapybara use: metadata extraction from article drafts.

Task routing cards

WordPress plugin architecture

Use: GPT-5.5

Why: hidden side effects cost more than token savings.

Security-sensitive PHP

Use: GPT-5.5

Why: stronger risk detection and safer review quality.

CSS/layout bug

Use: GPT-5.4 mini

Why: scoped and reversible with lower spend.

Log summarization

Use: GPT-5.4 mini

Why: good signal compression without premium-token burn.

Content extraction

Use: GPT-5.4 nano

Why: high-volume structured work at low cost.

Final review

Use: GPT-5.5

Why: final gate is where quality matters most.

OpenAI pricing pressure

Use this to understand cost pressure, not as live billing advice. Re-check OpenAI pricing before high-cost runs.

GPT-5.5

Input: $5.00 / 1M

Cached input: $0.50 / 1M

Output: $30.00 / 1M

Use for: high-risk judgment and final review

GPT-5.4

Input: $2.50 / 1M

Cached input: $0.25 / 1M

Output: $15.00 / 1M

Use for: bounded implementation

GPT-5.3-Codex

Input: $1.75 / 1M

Cached input: $0.175 / 1M

Output: $14.00 / 1M

Use for: Codex coding passes with clear scope

GPT-5.4 mini

Input: $0.75 / 1M

Cached input: $0.075 / 1M

Output: $4.50 / 1M

Use for: scoped fixes and verification

GPT-5.4 nano

Input: $0.20 / 1M

Cached input: $0.02 / 1M

Output: $1.25 / 1M

Use for: extraction and classification

Legacy Codex models

GPT-5.2-Codex: $1.75 input / $0.175 cached / $14.00 output

GPT-5.1-Codex: $1.25 input / output $10.00

GPT-5.1-Codex mini: $0.25 input / output $2.00

Use for: fallback/reference only

Model notes: what changes in practice

GPT-5.5

Use when: risk is high and failure is expensive.

Avoid when: the task is mechanical and narrow.

Example: final review of a WordPress patch touching data behavior.

GPT-5.4

Use when: implementation is clear and bounded.

Avoid when: root cause is still unknown.

Example: implementing approved feature slices in known files.

GPT-5.3-Codex

Use when: coding depth is needed but budget is tighter.

Avoid when: judgment-heavy direction decisions are unresolved.

Example: medium-size coding pass after planning is complete.

GPT-5.4 mini

Use when: fix scope and rollback are straightforward.

Avoid when: architecture and risk boundaries are vague.

Example: layout regression repair with targeted QA validation.

GPT-5.4 nano

Use when: output is structured extraction or classification.

Avoid when: coding safety or incident analysis is required.

Example: extracting metadata from draft content sets.

Legacy models: GPT-5.2-Codex and GPT-5.1-Codex are mostly useful as fallback/reference models if newer options are unavailable.

Token cost examples: what changes with task size

Small patch

Size: 15K input + 3K output

Cheapest sensible model: GPT-5.4 mini

Upgrade when: patch touches risky data/auth/redirect behavior.

Takeaway: for small tasks, risk matters more than cents.

Medium feature

Size: 80K input + 15K output

Cheapest sensible model: GPT-5.4 or GPT-5.3-Codex

Upgrade when: root cause or scope is still unclear.

Takeaway: scope clarity saves more than model downgrading.

Heavy repo task

Size: 250K input + 50K output

Cheapest sensible model: GPT-5.3-Codex (when risk is controlled)

Upgrade when: outcome affects release gating or rollback strategy.

Takeaway: isolate context before you scale model spend.

Show API-style example costs
Model Small patch Medium feature Heavy repo task
GPT-5.5 $0.165 $0.850 $2.750
GPT-5.4 $0.083 $0.425 $1.375
GPT-5.4 mini $0.025 $0.128 $0.413
GPT-5.4 nano $0.007 $0.035 $0.113
GPT-5.3-Codex $0.068 $0.350 $1.138
GPT-5.1-Codex mini $0.010 $0.050 $0.163

Long-context warning: GPT-5.5 long-context example is $5.700 and GPT-5.4 long-context example is $2.850. Reserve long context for tasks that truly need whole-system visibility.

What wastes Codex allowance

  1. Vague prompts: open-ended requests create expensive exploration loops.
  2. Too much repo context: unrelated files burn input tokens before execution starts.
  3. Long-running chats: stale context increases cost turn by turn.
  4. Repeated failed loops: retries without diagnosis burn output fast.
  5. Huge command output: full dumps cost more than targeted slices.

Copy-paste prompts for coding-agent workflow

Choose model
Classify this task before acting.

Project: [project name]
Environment: dev unless explicitly stated otherwise
Task: [paste task]

Return:
1. Risk class A/B/C/D/E
2. Recommended model
3. Why this model is enough
4. Required context
5. Context to exclude
6. Stop condition
Reduce context
Before editing, reduce context.

Identify:
1. Exact files likely relevant
2. Commands needed to confirm the issue
3. Minimum useful log slices
4. Files that should not be read
5. Whether this needs GPT-5.5 or can be handled by GPT-5.4 mini

Do not edit files in this step.
Safe implementation handoff
You are working on dev only.

Objective:
[one clear objective]

Allowed scope:
[files/directories]

Forbidden:
- Do not touch production
- Do not modify unrelated templates
- Do not install plugins
- Do not make broad rewrites

Preflight:
- confirm pwd/site root
- run git status --short
- confirm backup or create one
- identify files before editing

Implementation:
- make the smallest patch that solves the objective
- if PHP changes, run php -l on every changed PHP file

Verification:
- test exact URL(s)
- include desktop/mobile checks if UI changed
- report commands, changed files, URLs tested, blockers, rollback notes, and final git status

Stop condition:
- stop after this task; do not continue into adjacent improvements.

Copy-paste setup: risk ratings for your agent docs

Risk rating system

AHigh risk

Covers: security, database, auth, redirects, production, payments, affiliate tracking.

Default model: GPT-5.5.

Gate: backup, rollback notes, final review.

BMedium risk

Covers: multi-file feature, uncertain bug, plugin/theme interaction.

Default model: GPT-5.3-Codex or GPT-5.4 with clear scope.

Gate: escalate if root cause is unclear.

CLow risk

Covers: known file, reversible CSS/PHP/content fix.

Default model: GPT-5.4 mini.

Gate: targeted verification.

DPrep work

Covers: extraction, tagging, formatting, summaries.

Default model: GPT-5.4 nano or mini.

Gate: no code edits unless explicitly approved.

EFinal gate

Covers: pre-release review, production promotion, rollback decision.

Default model: GPT-5.5.

Gate: final release decision checklist.

Before Codex starts

  • Project
  • Environment
  • Site root
  • Target URL
  • Allowed files
  • Forbidden files/actions
  • Backup path
  • Risk rating
  • Model and effort
  • Verification commands
  • Stop condition
Copy raw Markdown template
## Agent Task Risk Rating

A - High risk:
Security, database, auth, redirects, production, payments, affiliate tracking.
Default model: GPT-5.5.
Requires backup, rollback notes, and final review.

B - Medium risk:
Multi-file feature, uncertain bug, plugin/theme interaction.
Default model: GPT-5.3-Codex or GPT-5.4 with clear scope.
Escalate if root cause is unclear.

C - Low risk:
Known file, reversible CSS/PHP/content fix.
Default model: GPT-5.4 mini.
Requires targeted verification.

D - Prep work:
Extraction, tagging, formatting, summaries.
Default model: GPT-5.4 nano or mini.
No code edits unless explicitly approved.

E - Final gate:
Pre-release review, production promotion, rollback decision.
Default model: GPT-5.5.

## Before Codex Starts

- Project:
- Environment:
- Site root:
- Target URL:
- Allowed files:
- Forbidden files/actions:
- Backup path:
- Risk rating:
- Model and effort:
- Verification commands:
- Stop condition:

Future article idea: a dedicated coding-agent safety runbook with risk ratings, backup rules, and model routing examples.

For broader context, see Best AI Stack for Building a WordPress Affiliate Site and Review Methodology.

Final recommendation

  • GPT-5.5 for high-risk judgment and final review.
  • GPT-5.4 / GPT-5.3-Codex for bounded implementation.
  • GPT-5.4 mini for scoped fixes, logs, and QA.
  • GPT-5.4 nano for extraction/routing prep only.

Cut waste by tightening scope first, then selecting model tier.

If your task starts getting bigger mid-run, pause, re-rate risk, and reset scope before continuing.