Rethinking epistemic rewards for llms
Karpathy’s food for thought is basically this: what if RL for LLMs stopped treating benchmarks as the only thing that matters and started rewarding actual epistemic gains inside the model’s head? Think rewards for: - Reducing its own uncertainty - Generating non-trivial new hypotheses - Improving calibration over time So more CD-RLHF / CURIO / MERCI vibes, but aimed at the structure of reasoning itself, not just external tasks and leaderboards. The hard part is: - Designing signals that do not pay the model for hallucinated novelty - Avoiding reward hacking where the model learns to look “curious” instead of being useful - Still giving it enough freedom to self-generate problems, proofs, and theories that are way richer than whatever eval suite your research engineer hacked together last quarter For founders, this is the real opportunity space. Today we mostly RLHF models to be polite and pass tests. Tomorrow we might RL them to become better scientists, debuggers, or co-founders that actually reason.