Anand Chowdhary

GPTImage15 topping all the leaderboards smells

GPT‑Image‑1.5 topping all the leaderboards smells a lot like our current “generalist” image benchmarks are overfitting to what is easy to measure. Things like text fidelity, prompt literalism, and pairwise “vibe” comparisons. Basically: does it look like what the prompt said, and will a crowd worker click left or right. I don’t think that’s how real workflows look. Actual creators care about boring, painful stuff like: - Editing the same asset 20 times without it drifting - Subject and style consistency across a whole campaign - Production‑ready aesthetics that survive compression, crops, and a grumpy art director Right now, our evals are quietly converging on what we can score with a simple rubric or a quick model call. Not on what professionals optimize for when money and deadlines are involved. My cofounder @carlobadini is a lot more into image & video models than I am, but if I were building in this space I’d treat public leaderboards as unit tests. Pass them, sure. But then design your own evals around the ugly edge cases from real user projects. That is where the moat lives.