Code Llama: Open Foundation Models for Code
Abstract
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.
Community
Please
How to edit the prompt for instruct models?
Introduces Code LLaMA: coding models built on LLaMA 2 (pretrained) through cascade of fine-tuning steps - 7B, 13B, and 34B; has specializations for python; SOTA on MultiPL-E (multilingual HumanEval). Infilling through using causal infilling prediction along with autoregressive objective; modified RoPE (position) embeddings to handle larger context lengths; instruction fine-tuning on custom datasets to increase utility and decrease toxicity. Start with LLaMA 2, code training (500B tokens) with infilling for base; python code training (100B tokens) for specialized model; long-context fine-tuning (LCFT - 20B tokens) for all three types; use instruction fine-tuning (5B tokens) for instruct model. Causal masking: move parts of training sequence to the end and autoregressively predict reordered sequence; suffix-prefix-middle (SPM) and prefix-suffix-middle (PSM) shuffles/formats (special tokens for start of P, S, M, and end of infilling span). LCFT: More tokens during inference by using position interpolation in RoPE by changing basis (through linear projection/rotation matrix). RLHF v5 self-instruct proprietary dataset. Small set of high-quality unnatural instructions (instruct set generated through interview questions, prompt generated) over the Python model gets comparable performance to GPT-4 on HumanEval and MBPP. Code LLaMA and the instruction model are the best models for multi-lingual human eval (MLHE, compared to other open models like StarCoder); also the best on line infilling on MLHE (infilling has little to no cost). Training losses are lower when mode is initialized from LLaMA 2 weights. Also has ablations on pass-at-k with different temperatures. Toxicity evaluations on ToxiGen, bias evaluations in BOLD (bias in open-ended language dataset), red teaming (adversarial testing), also evaluated false refusals. Appendix has acknowledgements, additional ablations and math reasoning results, infilling, LCFT, and prompts. From Meta.
Links: Meta Blog (research post), arxiv, PapersWithCode, GitHub