princeton-nlp commited on
Commit
152a4a9
1 Parent(s): 7177fa3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -4
README.md CHANGED
@@ -9,20 +9,31 @@ Contact: `{tianyug, awettig}@princeton.edu`
9
 
10
  💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
11
 
 
 
 
 
 
 
 
12
  ## Features
13
 
14
- - Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce this long-context instruction-tuned model that can stably handle up to 64K tokens.
 
15
  - This model is trained on
16
  - 20B carefully curated data mixture of short and long data (max length 64K).
17
- - Then fine-tuned on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
 
18
  - On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
19
- - We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. Our technical report will come soon.
 
20
 
21
  ## Benchmarking results
22
 
23
 
24
 
25
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/FLp9_R5NQR8HNxPsozCJv.png)
 
26
 
27
  You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
28
 
@@ -128,6 +139,15 @@ We divide the tasks into the following categories:
128
 
129
  Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
130
 
 
 
 
 
 
 
 
 
 
131
  ## Efficient training techniques
132
 
133
  We integrate several pieces of efficient training techniques in producing our models:
 
9
 
10
  💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
11
 
12
+ ## The ProLong Series
13
+
14
+ - princeton_nlp/Llama-3-8B-ProLong-64k-Base
15
+ - princeton_nlp/Llama-3-8B-ProLong-64k-Instruct ← you are here!
16
+ - princeton_nlp/Llama-3-8B-ProLong-512k-Base (soon-to-come)
17
+ - princeton_nlp/Llama-3-8B-ProLong-512k-Instruct (soon-to-come)
18
+
19
  ## Features
20
 
21
+
22
+ - Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens.
23
  - This model is trained on
24
  - 20B carefully curated data mixture of short and long data (max length 64K).
25
+ - For the 512K version, we continue training the base model for 5B more tokens, with a mixture of short, long (64K), and ultra long (512K) data.
26
+ - Then we fine-tuned them on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
27
  - On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
28
+ - We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our soon-to-come technique report.
29
+
30
 
31
  ## Benchmarking results
32
 
33
 
34
 
35
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/PPSuEMsUWIyrmrOV_88Xf.png)
36
+
37
 
38
  You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
39
 
 
139
 
140
  Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
141
 
142
+ <details>
143
+ <summary>Some more details about the evaluation.</summary>
144
+ - All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies.
145
+ - For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”.
146
+ - For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels ([Pan et al., 2023](https://arxiv.org/pdf/2305.09731)) to evaluate models’ ability to learn new tasks.
147
+ - We use greedy decoding for all models/tasks.
148
+ </details>
149
+
150
+
151
  ## Efficient training techniques
152
 
153
  We integrate several pieces of efficient training techniques in producing our models: