Active learning is back The real move is optimizi / 2025 / Notes / Anand Chowdhary

Active learning is back? The real move is optimizing for expert agreement on messy policy tasks… with high‑fidelity labels, up to 10,000x less data at equal or better quality. The catch is getting those labels. 🔖🔖🔖 Here’s the loop I’d build. Start with LLM‑0 to pseudo label a big pool. Cluster positives and negatives separately. Find overlapping clusters. Sample nearest‑neighbor disagreement pairs. Send those to experts. Split labels for eval and SFT. Iterate until model‑human Kappa hits expert‑expert Kappa or it flatlines. On two ads‑safety tasks, Gemini Nano‑2 at 3.25B matched or beat a 100k crowd baseline using roughly 250 to 450 expert ratings. Model‑human Kappa rose 55 to 65 percent. The 1.8B model barely moved, which points to representation quality as the real bottleneck. Targeted curation shifted positives from about 5 percent to about 40 percent. Metric is Cohen’s Kappa with expert agreement as the ceiling. You need around 0.8 plus pairwise Kappa to reliably beat crowd labels. How is this different from JEST, ACED, or ACID? In my read, it’s tuned for noisy, imbalanced, normatively ambiguous classification where the goal is expert alignment. Think data‑centric RLHF for classifiers. Selection mixes uncertainty and diversity with overlapping clusters and boundary pairs. The unglamorous details matter. Embeddings and the clusterer drive outcomes and are not fully specified in public write‑ups. Active learning for text can be brittle. Post‑hoc normalized studies sometimes find strong random sampling competitive. Wins depend on three things I care about a lot: representation quality, label fidelity, and a careful loop design. Why founders should care: If this holds beyond ads‑safety, fine‑tune economics change. You can move with hundreds of expert labels per iteration, handle drift faster, and ship smaller domain models. We’re shifting from more data to better data selected online. Expect convergence with preference pipelines like DPO where selection is active, and with weak‑supervision tooling. The hard parts turn operational. You need reliable retrieval, embeddings, and clustering at scale, plus consistent expert agreement at or above 0.8. The 10,000x claim hints at scalability, but domain transfer and generative behavior are still open. I’d love to see community replications with transparent selection code. If it holds up, the path to cheaper, faster alignment looks more like great data ops than bigger datasets. I’m cautiously excited.