Text Generation
Transformers
Safetensors
English
llama
finance
text-generation-inference
Inference Endpoints
instruction-pretrain commited on
Commit
95e8615
1 Parent(s): f35d466

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -9,7 +9,7 @@ datasets:
9
  - GAIR/lima
10
  - WizardLM/WizardLM_evol_instruct_V2_196k
11
  ---
12
- # Instruction Pre-Training: Language Models are Supervised Multitask Learners
13
  This repo contains the **finance model developed from Llama3-8B** in our paper [Instruction Pre-Training: Language Models are Supervised Multitask Learners](https://huggingface.co/papers/2406.14491).
14
 
15
  We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continual pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. **In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.**
@@ -19,7 +19,9 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
19
  </p>
20
 
21
  **************************** **Updates** ****************************
22
- * 2024/8/29: Updated [guidelines](https://huggingface.co/instruction-pretrain/finance-Llama3-8B) on evaluating any 🤗Huggingface models on the domain-specific tasks
 
 
23
  * 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
24
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M. The performance trend on downstream tasks throughout the pre-training process:
25
  <p align='left'>
@@ -140,7 +142,7 @@ text_ids = tokenizer(text, add_special_tokens=False, **kwargs).input_ids
140
  ## Citation
141
  If you find our work helpful, please cite us:
142
 
143
- Instruction Pre-Training
144
  ```bibtex
145
  @article{cheng2024instruction,
146
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
@@ -150,7 +152,7 @@ Instruction Pre-Training
150
  }
151
  ```
152
 
153
- [Adapt LLM to Domains](https://huggingface.co/papers/2309.09530)
154
  ```bibtex
155
  @inproceedings{
156
  cheng2024adapting,
 
9
  - GAIR/lima
10
  - WizardLM/WizardLM_evol_instruct_V2_196k
11
  ---
12
+ # Instruction Pre-Training: Language Models are Supervised Multitask Learners (EMNLP 2024)
13
  This repo contains the **finance model developed from Llama3-8B** in our paper [Instruction Pre-Training: Language Models are Supervised Multitask Learners](https://huggingface.co/papers/2406.14491).
14
 
15
  We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continual pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. **In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.**
 
19
  </p>
20
 
21
  **************************** **Updates** ****************************
22
+ * 2024/9/20: Our paper has been accepted by EMNLP 2024 main conference🎉
23
+ * 2024/9/11: Updated [FAQ on continual pre-training from Llama3](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
24
+ * 2024/8/29: Updated [guidelines](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B) on evaluating any 🤗Huggingface models on the domain-specific tasks
25
  * 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
26
  * 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M. The performance trend on downstream tasks throughout the pre-training process:
27
  <p align='left'>
 
142
  ## Citation
143
  If you find our work helpful, please cite us:
144
 
145
+ [Instruction Pre-Training](https://huggingface.co/papers/2406.14491) (EMNLP 2024)
146
  ```bibtex
147
  @article{cheng2024instruction,
148
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
 
152
  }
153
  ```
154
 
155
+ [Adapt LLM to Domains](https://huggingface.co/papers/2309.09530)(ICLR 2024)
156
  ```bibtex
157
  @inproceedings{
158
  cheng2024adapting,