I like that GPT5 isnt just another bigger model / 2025 / Notes / Anand Chowdhary

I like that GPT‑5 isn’t just another “bigger model.” It’s a system: a router that switches between a fast main model and a deeper reasoning model, with test‑time compute as a knob - and pricing that nudges you to actually turn it. Under the hood, ChatGPT routes between gpt‑5‑main (fast) and gpt‑5‑thinking (deeper) using a real‑time router trained on preference and correctness signals (yes, really!). It tries to pick when to go wide vs think hard - so you don’t have to babysit every call. For builders, the API ships gpt‑5, gpt‑5‑mini, and gpt‑5‑nano, plus a new reasoning_effort=minimal|low|medium|high and verbosity control. You get a dial, not a black box. Max context is 400K tokens (≈272K input + 128K for reasoning+output). Those “invisible” reasoning tokens are billed as output, so effort maps directly to cost. Benchmarks tell a cautious story. GPT‑5 reports 74.9% on SWE‑bench Verified (excluding 23/500 infra‑problem cases) and 97% on τ²‑bench for tool use, with strong long‑context MRCR gains. Solid numbers, mostly incremental over the o‑series. Also, some presentation hiccups were flagged by the community, so I’m reading the charts with a critical eye. Encouraged, not starry‑eyed. The strategy feels deliberate: collapse model sprawl behind a router, then compete on unit economics. Pricing at $1.25/M in and $10/M out (mini $0.25/$2, nano $0.05/$0.40) undercuts many - especially if prompt caching kicks in on repeated prefixes. Day‑0 momentum matters. Cursor rolled it out immediately, and Microsoft lit up Copilot across 365, GitHub (ahem, before even OpenAI could announce the new model), and Azure. Distribution still wins. Open questions I’m eager to test: - How robust is the router - misroutes, “think hard” sensitivity, tool‑need detection? - Can we budget reasoning tokens deterministically in production and get per‑turn traces? - How does the 400K window handle messy enterprise docs vs 1M‑context competitors? If you’re running evals, please share traces and failure modes. My take: if the router is stable and the knobs are predictable, this could feel magical. If not, it’s just an expensive thought experiment.