Single agent rivals multi agent
The mix of hour scale online RL, lean rewards, and test time parallel search lets a single agent rival the multi agent scaffolds we built. With SOL 25x HRL throughput and SAPO plus 94 percent reward, the blocker is verifiers and infra.
Hour scale online RL means you tweak, deploy, and get learning signals in a single afternoon, not next quarter. I have been running loops inside the editor, like Cursor, so the agent trains where it works.
Short feedback cycles beat clever prompts. Shipping learns faster.
Lean RL helps. Normalize rewards by length so you do not pay for word salads. Put a cap on tool calls so the agent learns to think before it spends.
Cheaper loops, cleaner signals, fewer weird incentives. My cloud bill says thanks.
At test time, run many rollouts in parallel and keep the shortest success path. Think of it as breadth first search with a stopwatch.
You get reliability without retraining. Parallel search first, minimal steps win. Simple, effective.
On the numbers: SOL reports around 25x throughput for hierarchical RL loops. SAPO shows about 94 percent higher reward in benchmarks.
The model side is racing ahead. The slow part now is everything around it.
So the bottleneck moves to verifiers and infra. You need strong checkers, fast sandboxes, reproducible evals, caching, and judge models that do not rubber stamp.
Are you investing more in evaluators than in prompts? What does your RL stack look like today?
My take: single agents plus hour scale training and smart search can replace a lot of complex scaffolds. The winners will nail verification and ops.