princeton-nlp
/

Llama-3-8B-ProLong-64k-Instruct

@@ -9,20 +9,31 @@ Contact: `{tianyug, awettig}@princeton.edu`
 💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
 ## Features
-- Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce this long-context instruction-tuned model that can stably handle up to 64K tokens.
 - This model is trained on
     - 20B carefully curated data mixture of short and long data (max length 64K).
-    - Then fine-tuned on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
 - On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
-- We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. Our technical report will come soon.
 ## Benchmarking results
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/FLp9_R5NQR8HNxPsozCJv.png)
 You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
@@ -128,6 +139,15 @@ We divide the tasks into the following categories:
 Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
 ## Efficient training techniques
 We integrate several pieces of efficient training techniques in producing our models:

 💡 ProLong stands for **Pr**incet**o**n **Long**-Context!
+## The ProLong Series
+- princeton_nlp/Llama-3-8B-ProLong-64k-Base
+- princeton_nlp/Llama-3-8B-ProLong-64k-Instruct ← you are here!
+- princeton_nlp/Llama-3-8B-ProLong-512k-Base (soon-to-come)
+- princeton_nlp/Llama-3-8B-ProLong-512k-Instruct (soon-to-come)
 ## Features
+- Based on [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) (original max length: 8K), we produce a long-context instruction-tuned model that can stably handle up to 64K tokens. We also have a version that can process up to 512K tokens.
 - This model is trained on
     - 20B carefully curated data mixture of short and long data (max length 64K).
+    - For the 512K version, we continue training the base model for 5B more tokens, with a mixture of short, long (64K), and ultra long (512K) data.
+    - Then we fine-tuned them on [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) to regain chat ability.
 - On a range of long-context tasks, our ProLong model achieves the top performance among models of similar sizes.
+- We conduct extensive ablations in our preliminary experiments, looking for the most effective way to extend LMs’ context length. We will include more details in our soon-to-come technique report.
 ## Benchmarking results
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/607f846419a5af0183d7bfb9/PPSuEMsUWIyrmrOV_88Xf.png)
 You can find results for more tasks and models in this [spreadsheet](https://docs.google.com/spreadsheets/d/1qGzimBE8F896p1m7_yWHnjyGX7kpEAeyaT1h2iTbNzE/edit?usp=sharing).
 Note that we are still actively developing our evaluation and the results/tasks are subject to change. We plan to include a more systematic evaluation in our technical report.
+<details>
+  <summary>Some more details about the evaluation.</summary>
+  - All the evaluation context length is determined by the llama-2 tokenizer to accommodate models with smaller vocabularies.
+  - For Json KV and RAG, we randomly sample positions of the target key-value pairs or the passages to test “lost-in-the-middle”.
+  - For ICL, we use abstract labels (0,1,2,3…) instead of natural language labels ([Pan et al., 2023](https://arxiv.org/pdf/2305.09731)) to evaluate models’ ability to learn new tasks.
+  - We use greedy decoding for all models/tasks.
+</details>
 ## Efficient training techniques
 We integrate several pieces of efficient training techniques in producing our models: