Discussion about this post

User's avatar
Barada Sahu's avatar

a few observations on the space. IMO, the feedback stack is the profit stack. everyone's talking about bigger GPUs, but the scarcer asset is trusted reward data. OpenAI’s codex-style unit-test corpora, Anthropic’s constitutional critics, and DeepSeek’s cold-start chain-of-thought libraries now function like proprietary datasets. They compound: better models produce richer traces → richer traces feed better graders → graders mint harder tasks → moat widens. In some sense this is the meta learning for the models as models produce better hypothesis, the next generation of models get trained on these.

we are fixated on whether an answer is correct. For RL, it's how fine-grained can you can score partial progress.

- In chess and Go we reward every play via elo deltas.

- math proofs reward only at Q.E.D.

- Coding has a sweet mid spot —unit tests emit dense gradients (pass 7/10 tests).

- for domains with sparse signals, we will lag unless we invent dense synthetic metrics

value in the next cycle won’t be around shipping new LLMs; they’ll commoditize grading as a service. think “stripe for rewards”—plug-in APIs that mint dense, tamper-resistant reward signals for any digital domain. Whoever nails cross-domain, adversarial-resistant grading will dictate the pace of reasoning progress more than GPU fabs.

Expand full comment
Oliver Sourbut's avatar

Nice post! A very relevant related concept is *'exploration/experimentation'*.

Learning systems gather new knowledge and insights from observations/data. Random or arbitrary data aren't especially helpful! It wants to be telling you something new that you didn't already know - so it pays to deliberately seek out/gather novelty and informative observations.

In contemporary frontier AI systems, it's been mostly humans responsible for gathering that 'high quality data', often in quite hands-off ways like scraping huge datasets from the internet, but latterly with more attention on procurement and curation of especially informative or exemplary data.

With RL, the data coming in starts to rely increasingly on the activity of the system itself - together with whatever grading mechanism is in place (which is what you foreground here). That's why lots of RL conversations of the past were so obsessed by exploration: taking judicious actions to get the most informative observations!

Still, in RL, the human engineers are able to curate training environments with high-signal automated feedback systems, as you've discussed here. On the other hand, once you're talking about activities like R&D of various kinds, the task of exploring *is inherently most of the task itself*, making within-context exploration essential! This makes 'learning to learn' or 'learning to explore/experiment' among the most useful ways to operationalise 'AGI', from my perspective. (Of course there are nevertheless also many transformative impacts that can come from AI merely with heaps of crystallised intelligence and less R&D ability.)

I'm pretty uncertain on how domain-generalisable this meta skill of 'being good at learning to explore' will turn out to be. I think the evidence from humans and orgs is that it's somewhat generalisable (e.g. IQ, g-factors, highly productive R&D orgs which span domains, the broader institution of 'science and technology' post those particular revolutions), but that domain-specific research taste, especially at frontiers, is also something that's necessarily acquired, or at least mastered, through domain-specific experience.

Expand full comment
12 more comments...

No posts

OSZAR »