GPT5 didnt suddenly get dumber the router did / 2025 / Notes / Anand Chowdhary

GPT-5 didn’t suddenly get dumber - the router did. Autoswitching across models made capability feel random, so folks blamed the brain instead of the traffic cop. Fixes are rolling and 4o is back while they patch the decision logic. Under the hood: GPT-5 is a unified system with a fast gpt-5-main, a deeper gpt-5-thinking, and a real-time router trained on live signals (model switches, preferences, correctness). Think “cheap vs deep” with a bouncer at the door. The API adds light controls (reasoning_effort, verbosity, custom tools). ChatGPT auto-escalates to “Thinking” with a manual override. Benchmarks hit 74.9% on SWE-bench Verified, Plus manual “Thinking” is ~200/week and autoswitching doesn’t count against it. It’s MoR, productized. Rollout gremlin: a sev in the autoswitcher under-selected the “think” path, so GPT-5 looked worse than 4o in practice. Add a UI that hid which path you were on and a chart mishap during the demo - trust took a hit (because of course it did). Course-correct: 4o picker is back, Plus caps doubled, clearer model indicators coming, and router decision boundaries are being tuned. Despite the bumps, adoption spiked - API traffic ~2× in 24h with peaks near 2B tokens/min. Founders noticed, so did their Cloud bills. Bigger picture: routing is the new UX. When the gate misfires, users perceive “IQ drift” even if the base models improve. The router learns from engagement and correctness, transparency and manual overrides are the safety valves until a single model can subsume both paths. On the enterprise side, Copilot moved to GPT-5 across platforms, and Priority Processing brings explicit latency SLAs. For agentic stacks, stable TTFT beats a higher mean - jitter breaks the illusion of intelligence faster than almost anything else. Three practical questions I’m thinking about: 1) How is misrouting defined, measured, and corrected online without feedback loops dragging quality to the mean? 2) What’s the right “think” budget policy - per-task class, per-user, or session-level bandits? 3) What TTFT/throughput will Priority actually hit under bursty load, and how should apps degrade gracefully? Net-net: the model got better, the router stumbled, and the lesson is timeless - when you ship a chooser, you’re shipping a product. Measure it, expose it, and give power users an off-switch.