princeton-nlp
commited on
Commit
•
152a4a9
1
Parent(s):
7177fa3
Update README.md
Browse files
README.md
CHANGED
@@ -9,20 +9,31 @@ Contact: `{tianyug, awettig}@princeton.edu`
|
|
9 |
|
10 |
💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
## Features
|
13 |
|
14 |
-
|
|
|
15 |
- This model is trained on
|
16 |
- 20B carefully curated data mixture of short and long data (max length 64K).
|
17 |
-
-
|
|
|
18 |
- On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
|
19 |
-
- We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length.
|
|
|
20 |
|
21 |
## Benchmarking results
|
22 |
|
23 |
|
24 |
|
25 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/
|
|
|
26 |
|
27 |
You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
|
28 |
|
@@ -128,6 +139,15 @@ We divide the tasks into the following categories:
|
|
128 |
|
129 |
Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
|
130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
## Efficient training techniques
|
132 |
|
133 |
We integrate several pieces of efficient training techniques in producing our models:
|
|
|
9 |
|
10 |
💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
|
11 |
|
12 |
+
## The ProLong Series
|
13 |
+
|
14 |
+
- princeton_nlp/Llama-3-8B-ProLong-64k-Base
|
15 |
+
- princeton_nlp/Llama-3-8B-ProLong-64k-Instruct ← you are here!
|
16 |
+
- princeton_nlp/Llama-3-8B-ProLong-512k-Base (soon-to-come)
|
17 |
+
- princeton_nlp/Llama-3-8B-ProLong-512k-Instruct (soon-to-come)
|
18 |
+
|
19 |
## Features
|
20 |
|
21 |
+
|
22 |
+
- Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens.
|
23 |
- This model is trained on
|
24 |
- 20B carefully curated data mixture of short and long data (max length 64K).
|
25 |
+
- For the 512K version, we continue training the base model for 5B more tokens, with a mixture of short, long (64K), and ultra long (512K) data.
|
26 |
+
- Then we fine-tuned them on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
|
27 |
- On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
|
28 |
+
- We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our soon-to-come technique report.
|
29 |
+
|
30 |
|
31 |
## Benchmarking results
|
32 |
|
33 |
|
34 |
|
35 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/PPSuEMsUWIyrmrOV_88Xf.png)
|
36 |
+
|
37 |
|
38 |
You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
|
39 |
|
|
|
139 |
|
140 |
Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
|
141 |
|
142 |
+
<details>
|
143 |
+
<summary>Some more details about the evaluation.</summary>
|
144 |
+
- All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies.
|
145 |
+
- For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”.
|
146 |
+
- For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels ([Pan et al., 2023](https://arxiv.org/pdf/2305.09731)) to evaluate models’ ability to learn new tasks.
|
147 |
+
- We use greedy decoding for all models/tasks.
|
148 |
+
</details>
|
149 |
+
|
150 |
+
|
151 |
## Efficient training techniques
|
152 |
|
153 |
We integrate several pieces of efficient training techniques in producing our models:
|