This morning I went through the FACTS benchmark wh / 2025 / Notes / Anand Chowdhary

This morning I went through the FACTS benchmark where Gemini 3 Pro hits 68.8%. I see it like this: there’s basically three personalities: 1. GPT-style models: “Move fast and answer things.” High attempt rate, high error rate. They shoot their shot on almost everything, which looks impressive in a demo, until you realize how many confident wrong answers slip through. 2. Claude-style models: “If you don’t say anything, you can’t be wrong.” Low attempt rate, low contradiction. Feels safer, but you’re paying for it with a lot of “I’m not sure” when you actually needed an opinion. 3. Multimodal: still below 50% factual. We’re acting like adding images is a cheat code, but the factual core is still shaky. Most of what we’re doing today is fiddling with decision thresholds on top of a stochastic nonsense generator… turn the knob toward recall and you get spicy hallucinations, but turn it toward precision and your product goes quiet. But we haven’t gotten to a model yet that’s stable enough that these knobs behave like controls not bandaids.