Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,9 @@ This is a 8B reward model used for PPO training trained on the UltraFeedback dat
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
|
|
|
|
|
|
25 |
## Performance
|
26 |
|
27 |
We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
|
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
25 |
+
**Built with Meta Llama 3!**
|
26 |
+
Note that Llama 3 is released under the Meta Llama 3 community license, included here under `llama_3_license.txt`.
|
27 |
+
|
28 |
## Performance
|
29 |
|
30 |
We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
|