Abstract
Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. SSO maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. SSO can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of SSO with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, SSO leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by SSO significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.
Community
Minimal Human Intervention: The paper highlights the development of alignment systems that require minimal human intervention, which is a significant advantage in automating complex processes.
Automated Preference Signals: It introduces the concept of generating learnable and accurate preference signals automatically, without the need for human annotation, which addresses a major challenge in preference learning.
Self-Steering Optimization (SSO) Algorithm: The introduction of SSO is a key innovation. This algorithm autonomously generates high-quality preference signals, maintaining their accuracy by ensuring a consistent gap between chosen and rejected responses, all while keeping these responses on-policy.
Versatility in Training: SSO is applicable to both online and offline training scenarios, making it a flexible tool for enhancing the training of policy models and reward models.
Empirical Validation: The effectiveness of SSO is validated through experiments with two foundational models, Qwen2 and Llama3.1, demonstrating its capability to provide accurate, on-policy preference signals throughout iterative training.
Performance Improvements: Without relying on manual annotation or external models, SSO achieves notable performance improvements across multiple benchmarks, both subjective and objective.
Enhanced Reward Model Performance: The preference data generated by SSO significantly improves the performance of the reward model on Rewardbench, further validating its utility.
Scalability and Efficiency: The paper concludes by presenting SSO as a scalable solution for preference optimization, which could lead to more efficient and effective methods for automated alignment in various applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins (2024)
- Self-Boosting Large Language Models with Synthetic Preference Data (2024)
- CREAM: Consistency Regularized Self-Rewarding Language Models (2024)
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs (2024)
- Just say what you want: only-prompting self-rewarding online preference optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper