Update README.md
Browse files
README.md
CHANGED
@@ -20,7 +20,8 @@ datasets:
|
|
20 |
|
21 |
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
|
22 |
|
23 |
-
NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
|
|
|
24 |
|
25 |
This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
|
26 |
All the models are pre-trained on the same dataset and with the same tokenizer.
|
@@ -40,11 +41,9 @@ _____
|
|
40 |
## Pretraining corpus
|
41 |
|
42 |
The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
|
43 |
-
This resulted in over 34B tokens of Norwegian (Bokmål or Nynorsk) in total.
|
44 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
45 |
-
The
|
46 |
-
|
47 |
-
|
48 |
|
49 |
_____
|
50 |
## Model details
|
|
|
20 |
|
21 |
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
|
22 |
|
23 |
+
NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
|
24 |
+
continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
|
25 |
|
26 |
This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
|
27 |
All the models are pre-trained on the same dataset and with the same tokenizer.
|
|
|
41 |
## Pretraining corpus
|
42 |
|
43 |
The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
|
44 |
+
This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens
|
45 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
46 |
+
The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
|
|
|
|
|
47 |
|
48 |
_____
|
49 |
## Model details
|