Token Classification
Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
f73c6eb
1 Parent(s): b01abe0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -19,8 +19,10 @@ license: apache-2.0
19
  Tulu is a series of language models that are trained to act as helpful assistants.
20
  Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
21
  This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean) model.
 
22
  We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
23
 
 
24
  For more details, read the paper:
25
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
26
 
@@ -53,7 +55,7 @@ We have included a [chat template](https://huggingface.co/docs/transformers/main
53
  ## Intended uses & limitations
54
 
55
  The model was initially fine-tuned on a filtered and preprocessed of the [Tulu V2 mix dataset](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs.
56
- We then further trained the model with a [Jax RM trainer](https://github.com/hamishivi/EasyLM/blob/main/EasyLM/models/llama/llama_train_rm.py) built on [EasyLM](https://github.com/young-geng/EasyLM) on the dataset mentioned above.
57
  This model is meant as a research artefact.
58
 
59
  ### Training hyperparameters
 
19
  Tulu is a series of language models that are trained to act as helpful assistants.
20
  Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
21
  This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean) model.
22
+ It was initialised from the [Tulu v2.5 13B UltraFeedback RM](https://huggingface.co/allenai/tulu-v2.5-13b-uf-rm).
23
  We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
24
 
25
+
26
  For more details, read the paper:
27
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
28
 
 
55
  ## Intended uses & limitations
56
 
57
  The model was initially fine-tuned on a filtered and preprocessed of the [Tulu V2 mix dataset](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture), which contains a diverse range of human created instructions and synthetic dialogues generated primarily by other LLMs.
58
+ We then further trained the model with a [Jax PPO trainer](https://github.com/hamishivi/EasyLM/blob/main/EasyLM/models/llama/llama_train_ppo.py) built on [EasyLM](https://github.com/young-geng/EasyLM) on the dataset mentioned above.
59
  This model is meant as a research artefact.
60
 
61
  ### Training hyperparameters