WPO: Enhancing RLHF with Weighted Preference Optimization Paper • 2406.11827 • Published 21 days ago • 13
Bootstrapping Language Models with DPO Implicit Rewards Paper • 2406.09760 • Published 24 days ago • 37
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM Paper • 2406.12168 • Published 20 days ago • 7
Understanding and Diagnosing Deep Reinforcement Learning Paper • 2406.16979 • Published 15 days ago • 8
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs Paper • 2406.18629 • Published 12 days ago • 37
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation Paper • 2406.18676 • Published 12 days ago • 5
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Paper • 2407.00782 • Published 8 days ago • 21
Direct Preference Knowledge Distillation for Large Language Models Paper • 2406.19774 • Published 10 days ago • 20