Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ Meltemi is built on top of [Mistral-7B](https://huggingface.co/mistralai/Mistral
|
|
16 |
# Model Information
|
17 |
|
18 |
- Vocabulary extension of the Mistral-7B tokenizer with Greek tokens
|
19 |
-
-
|
20 |
- We extend the pretraining of Mistral-7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
|
21 |
* This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
|
22 |
* This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below:
|
|
|
16 |
# Model Information
|
17 |
|
18 |
- Vocabulary extension of the Mistral-7B tokenizer with Greek tokens
|
19 |
+
- 8192 context length
|
20 |
- We extend the pretraining of Mistral-7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
|
21 |
* This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
|
22 |
* This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below:
|