InstructRetro
Documentation Paper Evaluation Data Model Weights
InstructRetro (Wang et al., 2023b) scales up the size of Retro to 48B, featuring the largest LLM pretrained with retrieval (as of December 2023). The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity. With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, and 10% over GPT across 4 challenging long-form QA tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the InstructRetro decoder backbone as GPT, while achieving comparable results.
For more information about InstructRetro, check the Documentation!
Background
Retro (Borgeaud et al., 2022) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. Retro features practical scalibility to support large-scale pretraining from scratch by retrieving from trillions of token. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. Retro also provides the flexibility to update the knowledge stored in LMs (Wang et al., 2023a) by updating the retrieval database without training LMs again.
Overview
License
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement.
Supported Hardware
- H100
- A100 80GB, A100 40GB
Model Version(s)
retro-8b-instruct-4k
: Pretrained Retro 8B LM with instruction tuning.
Toolkit
Environment
We recommend using docker environment to run the code.
Docker image
We provide a docker build file in Dockerfile for the reproduction. The docker image is based on nvcr.io/nvidia/pytorch:23.09-py3
.
Install dependencies
Clone the Megatron repo:
git clone --branch InstructRetro https://github.com/NVIDIA/Megatron-LM.git
If docker is not available, we recommend starting from a clean conda environment with the following runtime dependencies:
- Python 3.10
- NVIDIA CUDA® 12.2.1
- NVIDIA cuBLAS 12.2.5.6
- NVIDIA cuDNN 8.9.5
- NVIDIA NCCL 2.18.5
- PyTorch 2.1.0a0+32f93b1
Then install Retro-specific dependencies, including:
pip install -U faiss-gpu
pip install -U transformers
pip install -U sentencepiece
pip install -U h5py
pip install -U nltk
pip install -U einops
Evaluation Command
Download our model checkpoint and tokenizer.
Specify the blank args in the tools/retro/text_generation/retro_generate.sh script, including model path, Retro workdir, and model related params.
Parameter | Value | Explanation |
---|---|---|
mod_par | 4 | Tensor parallelism |
layers | 32 | Number of layers in the model |
hid_dim | 4096 | Hidden dimension size |
heads | 32 | Number of attention heads |
pip_par | 1 | Pipeline parallelism |
We present an example command to run retro generation with the InstructRetro checkpoints for the Natural Question (NQ) task. The example command is for the 8b InstructRetro. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints.
bash tools/retro/text_generation/retro_generate.sh nq 8b greedy test 0 20000 1000 5 pp1 <path/to/checkpoint> 2
The generated responses will be saved in the corresponding checkpoint directory. For example, for the 8b InstructRetro, it will be saved to
<path/to/retro>/retro-generate-nq_5_2_8b_test_greedy_0_20000_1000.txt
.
To evaluate the F1 / Exact Match (EM) scores of the generated responses, we provide an example script to run the evaluation on the NQ dataset. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints and downstream tasks.
python3 tools/retro/text_generation/evaluate.py
Citations
See more details from our papers:
Shall we Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study.
Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro. (EMNLP 2023)
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro. (ICML 2024)
Please cite the papers as follows if you use the data or code from this repo:
@inproceedings{wang2023shall,
title = {Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study},
author = {Boxin Wang and Wei Ping and Peng Xu and Lawrence McAfee and Zihan Liu and Mohammad Shoeybi and Yi Dong and Oleksii Kuchaiev and Bo Li and Chaowei Xiao and Anima Anandkumar and Bryan Catanzaro},
journal = {The 2023 Conference on Empirical Methods in Natural Language Processing},
year = {2023}
}
@article{wang2023instructretro,
title = {InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining},
author = {Boxin Wang and Wei Ping and Lawrence McAfee and Peng Xu and Bo Li and Mohammad Shoeybi and Bryan Catanzaro},
year = {2023},
journal = {arXiv preprint arXiv: 2310.07713}
}