snorkelai
/

Snorkel-Mistral-PairRM-DPO

@@ -1,15 +1,17 @@
 ---
 license: apache-2.0
 datasets:
-- openbmb/UltraFeedback
 ---
 Original post: [Snorkel link]
-#### Dataset:
-ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
-#### Methodology:
   1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
@@ -18,15 +20,12 @@ ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFac
 This overview provides a high-level summary of our approach.
 We plan to release more detailed results and findings in the coming weeks on the [Snorkel blog](https://snorkel.ai/blog/).
-#### Key Premises:
 - **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
 - **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes. We call this **Programmatic Alignment** - capturing domain knowledge in programmatic forms that can be used to guide LLM improvement.
-#### Contemporary Work and Acknowledgements:
-#### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
 the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
 Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
@@ -39,7 +38,7 @@ that reflect your enterprises' needs**, please contact the Snorkel AI team or co
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
 to learn more about "Programmatically scaling human preferences and alignment in GenAI".
-#### Result:
 On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
@@ -51,13 +50,13 @@ We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the fu
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
 Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.
-## Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
 It does not have any moderation mechanisms.
 We look forward to continuing to engage with the research community and our customers exploring optimal methods for gettings models to respect guardrails,
 allowing for deployment in environments requiring moderated outputs.
-## Acknowledgments:
 - The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
@@ -67,5 +66,5 @@ which proposes a similar general approach for creating alignment pairs from a la
 While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
 enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
-## The Snorkel AI Team
 Hoang Tran, Chris Glaze, Braden Hancock

 ---
 license: apache-2.0
 datasets:
+- snorkelai/Snorkel-Mistral-Self-Improvement
 ---
 Original post: [Snorkel link]
+### Dataset:
+Training dataset: [snorkelai/Snorkel-Mistral-Self-Improvement](link)
+We utilize ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
+### Methodology:
   1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
 This overview provides a high-level summary of our approach.
 We plan to release more detailed results and findings in the coming weeks on the [Snorkel blog](https://snorkel.ai/blog/).
+### Key Premises:
 - **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
 - **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes. We call this **Programmatic Alignment** - capturing domain knowledge in programmatic forms that can be used to guide LLM improvement.
+### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
 the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
 Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
 to learn more about "Programmatically scaling human preferences and alignment in GenAI".
+### Result:
 On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
 Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.
+### Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
 It does not have any moderation mechanisms.
 We look forward to continuing to engage with the research community and our customers exploring optimal methods for gettings models to respect guardrails,
 allowing for deployment in environments requiring moderated outputs.
+### Contemporary Work and Acknowledgements:
 - The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
 While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
 enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
+### The Snorkel AI Team
 Hoang Tran, Chris Glaze, Braden Hancock