Update README, clarify training data.

#9

The previous sentence implies the models were only trained on "Norwegian data". Possible interpretations for this sentence are:

  • data is exclusively produced by Norwegians
  • data is exclusively produced in Norway
  • data that is inherently in the Norwegian language

None of these interpretations are correct.

Norwegian Large Language Models org

Hi, thanks for opening a pull request! However, the base models were not trained on translated Norwegian; your correction is, in fact, incorrect :) You can find more details about the pretraining corpus in this section.

You're right, it seems it is only ´Instruction-tuned NorMistral-7b-warm´ which is trained on machine translated data.
I mistakenly thought https://huggingface.co/norallm/normistral-7b-warm-instruct/ was a describing the common training process for all the models since all the models were being summarized on the page.

Stealcase changed pull request status to closed

Sign up or log in to comment