Update README.md
Browse files
README.md
CHANGED
@@ -19,8 +19,10 @@ license: apache-2.0
|
|
19 |
Tulu is a series of language models that are trained to act as helpful assistants.
|
20 |
Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
|
21 |
This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean) model.
|
|
|
22 |
We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
|
23 |
|
|
|
24 |
For more details, read the paper:
|
25 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
|
26 |
|
@@ -53,7 +55,7 @@ We have included a [chat template](https://huggingface.co/docs/transformers/main
|
|
53 |
## Intended uses & limitations
|
54 |
|
55 |
The model was initially fine-tuned on a filtered and preprocessed of the [Tulu V2 mix dataset](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs.
|
56 |
-
We then further trained the model with a [Jax
|
57 |
This model is meant as a research artefact.
|
58 |
|
59 |
### Training hyperparameters
|
|
|
19 |
Tulu is a series of language models that are trained to act as helpful assistants.
|
20 |
Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
|
21 |
This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean) model.
|
22 |
+
It was initialised from the [Tulu v2.5 13B UltraFeedback RM](https://huggingface.co/allenai/tulu-v2.5-13b-uf-rm).
|
23 |
We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
|
24 |
|
25 |
+
|
26 |
For more details, read the paper:
|
27 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
|
28 |
|
|
|
55 |
## Intended uses & limitations
|
56 |
|
57 |
The model was initially fine-tuned on a filtered and preprocessed of the [Tulu V2 mix dataset](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs.
|
58 |
+
We then further trained the model with a [Jax PPO trainer](https://github.com/hamishivi/EasyLM/blob/main/EasyLM/models/llama/llama_train_ppo.py) built on [EasyLM](https://github.com/young-geng/EasyLM) on the dataset mentioned above.
|
59 |
This model is meant as a research artefact.
|
60 |
|
61 |
### Training hyperparameters
|