metadata

license: llama3
language:
  - en
tags:
  - finance
datasets:
  - Open-Orca/OpenOrca
  - GAIR/lima
  - WizardLM/WizardLM_evol_instruct_V2_196k

Quantized to exl2 using Exllamav2 0.0.2

Instruction Pre-Training: Language Models are Supervised Multitask Learners

This repo contains the finance model developed from Llama3-8B in our paper Instruction Pre-Training: Language Models are Supervised Multitask Learners.

We explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. Instruction Pre-Training outperforms Vanilla Pre-training in both general pre-training from scratch and domain-adaptive continual pre-training. In pre-training from scratch, Instruction Pre-Training not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B.

Resources

🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗

Context-Based Instruction Synthesizer: instruction-synthesizer
Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection
General Models Pre-Trained from Scratch:
- InstructLM-500M
- InstructLM-1.3B
Domain-Specific Models Pre-Trained from Llama3-8B:
- Finance-Llama3-8B
- Biomedicine-Llama3-8B
General Instruction-Augmented Corpora: general-instruction-augmented-corpora
Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora

Domain-Adaptive Continued Pre-Training

Following AdaptLLM, we augment the domain-specific raw corpora with instruction-response pairs generated by our context-based instruction synthesizer.

1. To chat with the finance-Llama3-8B model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("instruction-pretrain/finance-Llama3-8B")
tokenizer = AutoTokenizer.from_pretrained("instruction-pretrain/finance-Llama3-8B")

# Put your input here, NO prompt template is required
user_input = '''Use this fact to answer the question: Title of each class Trading Symbol(s) Name of each exchange on which registered
Common Stock, Par Value $.01 Per Share MMM New York Stock Exchange
MMM Chicago Stock Exchange, Inc.
1.500% Notes due 2026 MMM26 New York Stock Exchange
1.750% Notes due 2030 MMM30 New York Stock Exchange
1.500% Notes due 2031 MMM31 New York Stock Exchange

Which debt securities are registered to trade on a national securities exchange under 3M's name as of Q2 of 2023?'''

inputs = tokenizer(user_input, return_tensors="pt", add_special_tokens=True).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_new_tokens=400)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(pred)

2. To evaluate our models on the domain-specific tasks

Set up dependencies

git clone https://github.com/microsoft/LMOps
cd LMOps/adaptllm
pip install -r requirements.txt

Evaluate

DOMAIN='finance'

# if the model can fit on a single GPU: set MODEL_PARALLEL=False
# elif the model is too large to fit on a single GPU: set MODEL_PARALLEL=True
MODEL_PARALLEL=False

# number of GPUs, chosen from [1,2,4,8]
N_GPU=1

# Set as True
add_bos_token=True

bash scripts/inference.sh ${DOMAIN} 'instruction-pretrain/finance-Llama3-8B' ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}

Citation

If you find our work helpful, please cite us:

AdaptLLM

@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}