Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Community
Introduces the next version of LLaMa (LLaMa 2) auto-regressive transformer: better data cleaning, longer context length (more tokens), and grouped-query attention (GQA - K and V projections shared across multiple heads). Self-supervised learning on pretraining data to get LLaMa 2, supervised fine-tuning for initial LLaMa-2-chat, iteratively refine chat model through RLHF (rejection sampling with PPO) - human feedback for safety and reward models. Same tokenizer as LLaMA-1 (BPE SentencePiece, 32k tokens). Better than open source models like Falcon and MPT, but doesn’t match up to GPT-4 and PaLM-2-L performance (closed models). Fine-tuning for LLaMA-2-chat: supervised fine-tuning (SFT) uses high quality fine-tuning data (crowdsourced/annotated) - zero out user prompts when training (back-propagate only on answer tokens); and RLHF. Reinforcement learning with human feedback (RLHF) has: reward model made using chat model (with classification head for autoregressive next-token prediction replaced by a regression head for scalar reward prediction), modified (variable margin for distinctiveness of responses) binary ranking loss, rejection sampling fine-tuning for 70B model (distill to smaller models) and then PPO (as annotated reward batches come in) with reward having safety and helpfulness. Proposes Ghost attention (GAtt) for multi-turn consistency after RLHF (Context Distillation to remember previous/initial instructions). Testing, alongside traditional safety, helpfulness, bias, and truthfulness analysis, also includes red teaming (simulating real world attacks and exploiting security threats by professional actors) data - used for fine-tuning (feedback training). Lower violation percentages and higher safety and helpfulness ratings compared to MPT, Vicuna, Falcon, PaLM, and ChatGPT. Better than other open-source models on MMLU (Massive Multitask Language Understanding), Q&A, Human-Eval and MBPP (code generation), natural questions, SQUAD (comprehension), AGI Eval, and mathematical reasoning. Further analysis and model card in appendix. From Meta.
Links: website, Meta AI blog post, arxiv, Meta news, GitHub (Older LLaMA - v1)
Try the demo space with 70b version of the model: https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
Meta's newly released large language model Llama 2 is not open source
https://www.theregister.com/2023/07/21/llama_is_not_open_source/
I had several questions about the RL tuning methodology in the paper, and one of the authors (@louismartin) kindly provided insightful answers that may be useful to other groups doing RLHF. The following is reproduced with his permission.
Q: Can you share any details on the dataset used for RL tuning? In particular, was it composed of multi-turn dialogues involving humans interacting with your intermediate RLHF checkpoints or something else? I'd also be interested to know if you used any of the open prefernence modeling datasets as a source of prompts for RL tuning & in which quantity.
A: We used multi-turn dialogs of humans interacting with intermediate versions of our RLHF checkpoints indeed. We used a few of the open preference modeling datasets as a source of prompts but we realized that most of the gains came from the data we collected. Probably because it's on-distribution wrt our model.
[Lewis note] I think this partially explains why we haven't seen a flood of open RLHF models to date - it's inherently expensive to collect on-distribution preference labels, especially at the scale that Meta did.
Q: One follow up question re the dataset you used for RL tuning: was it of similar size to your SFT dataset or much larger?
A: We use all reward modeling prompts in rejection sampling and most of it in PPO but I don't have the exact proportion.
[Lewis note] it's cool to see that one can re-use preference data for RL tuning in this way. By contrast, the InstructGPT paper took pains to ensure all prompts were unique across SFT/RM/RL, but this is likely not needed for dialogue applications.
Q: In your approach to rejection sampling, you say you "use the selected outputs for a gradient update". Do I understand correctly that you do this in an "online" fashion, i.e. take a batch of prompts, generate/rank K samples, and then do a backward pass? If that's correct, can you share any details on how large your batches were?
A: No for rejection sampling we sample on our whole dataset using the same checkpoint, select the best outputs for each prompts, and then finetune the checkpoint once on the whole dataset with the best outputs.
[Lewis note] This is very exciting as rejection sampling tuning is significantly simpler to implement than PPO (and all it's instabilities 🙃)!
hi all................I tried to extract the question answer from given context using llama2.............. If you have any ideas about the implementation pls share with me .............................
I'm sharing some more insights from correspondence with Kevin Stone (@kevinleestone), this time on PPO. As before, the following is reproduced with his permission.
Q: Louis mentioned that you fine-tuned on the Meta Safety / Helpfulness reward modelling corpora. Did you simply use the N-1 turns of each dialogue as a prompt and train on the full ~1.5M examples of Meta Safety / Helpfulness? Also, I'm curious what kind of mix (if any) you used between the Meta and open source datasets - was it the same proportion you described for reward modelling?
A: We used the N-1 turns from roughly the same distribution we used for reward modeling.
[Lewis note] This is quite a departure from existing literature like InstructGPT and Anthropic's papers, which typically run PPO on a far smaller scale, i.e. ~100K prompts vs the 1.5M prompts in the Meta corpora!
Q: During generation, did you use unbiased sampling (i.e. T=1, no nucleus sampling)? We've found that deviating from unbiased sampling can lead to pathologies like negative KL divergence, and I'm curious if you also observed that.
A: For generation we used temp=1 and no other sampling tricks. I ran a couple of small experiments with these and didn’t see any gains. My hunch is online RL is good at “fixing” sampling in a first-class way rather than top-p and repetition penalty which are more like heuristics. I say this because I was able to train a 4b model with PPO and fix all repetition issues even when using simple t=1 sampling.
[Lewis note] Fun story: the sensitivity of PPO to (seemingly) innocuous things like text generation heuristics is one of the many "bugs" I have lost quite a few nights sleep over 🤪. We really need an NLP version of Andy Jones' excellent blog post on debugging RL systems: https://andyljones.com/posts/rl-debugging.html
Q: For the KL penalty, did you experiment with alternative estimates like those proposed in Schulman's blog post (http://joschu.net/blog/kl-approx.html)? We haven't found these alternatives to have much positive impact, but we've heard from other groups that it does.
A: I had stability issues with the lower variance KL estimates in small scale experiments, so I defaulted to using simple KL approximation as described in the paper.
Q: During PPO, you note that generation was a bottleneck (our experience too!) and that you "mitigate this by consolidating the model weights to each node once before generation and then freeing the memory after generation". Do I understand correctly that you essentially save a checkpoint after each PPO step, free the memory to run inference and then resume training from the checkpoint? Since saving/loading large models has its own overhead, I'm curious what gains this had over the 20x slowdown you reported with naive generation?
A: A few more details about our PPO training:
- We use tensor parallelism of 8 for the 70b model
- We use FSDP for gradient steps which gathers the weights for each layer in turn and then frees the memory
- It takes about 5 seconds to gather and free all weights to all nodes using NCCL over the 4x100gbps EFA interconnect
- We wrote a small function that initializes a non-FSDP (consolidated) model layer by layer directly into GPU memory using similar NCCL code as used during training steps
- Once all the weights are local to each node, generation is about 20x faster - we do not have to pay the 5 second tax every forward pass during generation, only once up front.
- Besides using bf16 (which we also use during training) we did not use any other quantization schemes.
- After generation is complete, we free the consolidated model and continue to use standard FSDP for scoring and initial policy logprobs, (these only require one forward pass each) and ppo gradient updates (this only requires 8 forward and backward passes).
[Lewis note] As far as I know, none of the open source frameworks for RLHF apply this clever type of optimisation - they should!
Meta's newly released large language model Llama 2 is not open source
https://www.theregister.com/2023/07/21/llama_is_not_open_source/
Regarding license restriction, 700 million users per month is still a pretty high target.
Hi, I am wandering if there is any shared pipeline which can help to reproduce the reported results of Llama2.
I have some questions about the Gatt method. I believe it is not clearly explained. If anyone with a proper understanding of the Gatt method could answer the following questions, it would be appreciated by me and the community:
- How is actually "sampling from this synthetic data using the latest RLHF model" applied? We have a conversation of user and assistant messages. RLHF model needs a user message to sample an answer but there is none. If context-dialogue is used to sample a user question then what would be the prompt like? Would sampled user question include the instruction?
- How is the mentioned process analogous to Rejection sampling?
- Concatenation of an instruction to user messages in the given dataset e.g. act like someone, does not change the given answers. Is not this a problem?
There are enough resources to understand Context distillation and Rejection Sampling. So anyone kind enough to answer those questions does not need to bother to explain them.
Thanks in advance!
I can answer these questions.
How is actually "sampling from this synthetic data using the latest RLHF model" applied?
To be concrete, you define your system prompt like act as if you were Napoleon who likes playing cricket
. You have a conversation, where for each user input you start by your system prompt act as if you were Napoleon who likes playing cricket
, so the assistant reply is conditioned to reply following the user instruction.
USER: Act as if you were Napoleon who like playing cricket. <user_msg_1>
ASSISTANT: <reply_1>
USER: Act as if you were Napoleon who like playing cricket. <user_msg_2>
ASSISTANT: <reply_2>
...
USER: Act as if you were Napoleon who like playing cricket. <user_msg_N>
ASSISTANT: <reply_N>
At training time, you define your dataset as such:
SYSTEM: Act as if you were Napoleon who like playing cricket.
USER: <user_msg_1>
ASSISTANT: <reply_1>
USER: <user_msg_2>
ASSISTANT: <reply_2>
...
USER: <user_msg_N>
ASSISTANT: <reply_N>
Each reply from the assistant is conditioned by design on the system prompt; the model will learn to attend to the system prompt at each step
Rejection sampling consists in fine-tuning on specific dialogs; we do the same here, with dialogs formatted such as a "system prompt" is artificially conditioning the assistant replies
@javier-m Thanks for the answers. You beautifully explained. May I have follow-up questions:
We now have a context-dialogue and the sample with which to fine-tune a model...
Considering the example conversation you gave, what is the context-dialogue and the sample?
Could you please elaborate on the matter of zeroing out the loss for the previous turns?
Thanks in advance!
The context-dialogue refers to the system prompt I suppose, and in this case we have N samples:
[SYSTEM: ](SYSTEM: Act as if you were Napoleon who like playing cricket.
USER: <user_msg_1>
ASSISTANT: <reply_1>
[SYSTEM: ](SYSTEM: Act as if you were Napoleon who like playing cricket.
USER:
ASSISTANT:
USER:
ASSISTANT:
[SYSTEM: ](SYSTEM: Act as if you were Napoleon who like playing cricket.
USER: <user_msg_1>
ASSISTANT: <reply_1>
USER: <user_msg_2>
ASSISTANT: <reply_2>
...
USER: <user_msg_N>
ASSISTANT: <reply_N>
For each sample S, at training time, we pass the context:
[SYSTEM: ](SYSTEM: Act as if you were Napoleon who like playing cricket.
USER: <user_msg_1>
ASSISTANT: <reply_1>
USER: <user_msg_2>
ASSISTANT: <reply_2>
...
USER: <user_msg_S-1>
ASSISTANT: <reply_S-1>
USER: <user_msg_S>
And we train the probabilities on ASSISTANT: <reply_S>
: the loss is zeroed out for all tokens before, then you compute cross-entropy between ground-truth (the actual <reply_S>
) and the predicted tokens P(t>k|t0, ..., tk)
where k
is the length of the sequence
Llama 2: Redefining Large Language Models with Safety and Open Foundation
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/