Its not about how big it is its about how you u / 2025 / Notes / Anand Chowdhary

It’s not about how big it is, it’s about how you use it? It seems like the headline isn’t size anymore, it’s convergence. Closed SOTA nudges up on coding, but the gap to strong open weights is now single digits, and most of the real gains I’m seeing come from routing, post‑training, and agent scaffolds - not raw scale. My take: we’ve entered the Systems Era, where orchestration beats brute force. On paper, GPT‑5 lands 74.9% on SWE‑bench Verified with fewer tool calls than o3 and rolls out unified auto‑routing between fast chat and high‑effort reasoning. Meanwhile, gpt‑oss is a MoE (117B/21B total; ~5.1B/3.6B active) with alt dense+local sparse attention, grouped‑MQA, and 128k context - near o4‑mini quality while fitting in ~80GB/16GB VRAM. That reads like an execution win on routing and efficiency, not just params. Novel vs incremental: on coding, GPT‑5 vs Claude Opus 4.1 is within ~0.4 pts (74.9 vs 74.5 on SWE‑bench‑V). Strong OSS coders cluster in the high‑60s with the right agent scaffolds, and the benchmark itself is scaffold‑sensitive. The empirical story is clear: frontier deltas are marginal; bigger lifts are coming from routing, tool use, and post‑training recipes.Zooming out, open weights just got materially easier to run. gpt‑oss‑20B is Apache‑2.0 and tuned for 16GB‑VRAM/Colab‑class rigs; MXFP4 kernels prefer Hopper/Blackwell, and some T4s will need fallbacks. With GPT‑5 defaulting to free ChatGPT and growing third‑party routers/Responses API, competition tilts toward cost/perf routing and differentiation above the base model. What builders should optimize next: - Fair evaluation of routers/scaffolds with stat‑sig across tasks and cost‑normalized win rates - Post‑training/RLAIF that keeps delivering without brittle reward hacking - Sensible test‑time compute budgets relative to the ~12% per 10× downstream compute slope - Pricing that reflects capability under routing, not just tokens I’m convinced the center of gravity shifts from “which model” to “how you operate it”: evals, budgeted routing, feedback loops, and domain data become the moat (aka the unsexy stuff that wins).