metadata
license: apache-2.0
datasets:
- openbmb/UltraFeedback
language:
- en
This is a model released for our paper: REBEL: Reinforcement Learning via Regressing Relative Rewards.
Please refer to our repository for more details.
REBEL-Llama-3
This model is developed with REBEL based on Meta-Llama-3-8B-Instruct with FsfairX-LLaMA3-RM-v0.1 as the reward model and UltraFeedback dataset. The training code is available at https://github.com/ZhaolinGao/REBEL.
Links to Other Model
AlpacaEval 2.0 Evaluations
Model | AlpacaEval 2.0 LC Win Rate |
AlpacaEval 2.0 Win Rate |
---|---|---|
REBEL-OpenChat-3.5 | 17.3 | 12.8 |
REBEL-Llama-3 | 30.1 | 32.6 |
MT-Bench Evaluations
Model | MT-Bench 1st Turn |
MT-Bench 2nd Turn |
MT-Bench Average |
---|---|---|---|
REBEL-OpenChat-3.5 | 8.54 | 7.58 | 8.06 |
REBEL-Llama-3 | 8.63 | 7.69 | 8.16 |
Open LLM Leaderboard Evaluations
Model | MMLU (5-shot) |
GSM8K (5-shot) |
Arc (25-shot) |
Winogrande (5-shot) |
TruthfulQA (0-shot) |
HellaSway (10-shot) |
Average |
---|---|---|---|---|---|---|---|
REBEL-OpenChat-3.5 | 63.7 | 68.8 | 64.3 | 80.4 | 48.2 | 85.0 | 68.4 |
REBEL-Llama-3 | 65.8 | 75.6 | 61.7 | 75.8 | 51.7 | 78.8 | 68.2 |
Citation
Please cite our paper if you use this model in your own work:
@misc{gao2024rebel,
title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
year={2024},
eprint={2404.16767},
archivePrefix={arXiv},
primaryClass={cs.LG}
}