Internal Consistency and Self-Feedback in Large Language Models: A Survey Paper • 2407.14507 • Published Jul 19 • 46
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Paper • 2407.04842 • Published Jul 5 • 52
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Paper • 2407.00782 • Published Jun 30 • 23
Direct Preference Knowledge Distillation for Large Language Models Paper • 2406.19774 • Published Jun 28 • 21
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning Paper • 2407.00617 • Published Jun 30 • 7
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs Paper • 2406.18629 • Published Jun 26 • 40
Aligning Teacher with Student Preferences for Tailored Training Data Generation Paper • 2406.19227 • Published Jun 27 • 24
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data Paper • 2406.18790 • Published Jun 26 • 33
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt Paper • 2406.16377 • Published Jun 24 • 11
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation Paper • 2406.16855 • Published Jun 24 • 54
OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far? Paper • 2406.16772 • Published Jun 24 • 2
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Paper • 2406.12624 • Published Jun 18 • 36
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2 • 21
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Paper • 2407.18248 • Published Jul 25 • 30