Discussion about this post

User's avatar
The AI Architect's avatar

Really solid walkthrogh of where RL is right now. The shift from RLHF to RLAIF and now RLVR kinda mirrors a broader trend towrad automation in the entire training pipeline, but what's interesting is the epistemologicl trade-off. Verifiable rewards sound clean but they only work in domains where ground truth is computable. That limitation probably explains why aligment and reasoning are diverging into seperate training regimes, one still needs human judgement, the other can rely on formal correctness.

No posts

Ready for more?