Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5 • 10
Building Math Agents with Multi-Turn Iterative Preference Learning Paper • 2409.02392 • Published about 1 month ago • 14