Training data?
Hi, kudos to the Mistral AI team for developing and posting this very interesting model! Only one question so far: I haven’t found in the blog post, model card or GitHub repo information on the data used to train the baseline model, only the fine-tuning or the instruct version. Could you please share some of that as well, in line with the model’s open source philosophy? Many thanks in advance and congratulations on this work!
Hello, thanks for your interest and kind words! Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field. We appreciate your understanding!
Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web)
Well, that answers one of my main questions. This model is not trained on synthetic textbooks. Based on how well this model performs, perhaps textbooks are not all one needs. Thanks for that!
Yeah--Thanks for developing and sharing this model! Could you tell us when the model was trained?
I wish I had a pile of books to train on for real, maybe one day
I understand details of the dataset cannot be disclosed, but can you share some aggregate numbers, like for example the % of the text in the training dataset that is in Spanish? That would be useful for us when evaluating Mistral on vocabulary tests and seeing quite different performances between English and Spanish (see https://arxiv.org/abs/2310.14703) much more variations than for other models.
Hi @arthurmensch , i know you kinda answered this already, however, would be able to confirm that the data you used to train the model or the underlying dataset was "ethically" sourced? Or is this information confidential? The reason I am asking, is the potential to use your model in a corporate setting and the source of data means a lot to my company. Thanks!
Hi, I was wondering if we could get confirmation that there has been no data contamination with popular RLHF datasets such as TL;DR summarization or Anthropic HH? Thank you!
@jdchang i have the same question—it would be good to know if there is a risk of data contamination with preference/RLHF datasets such as hh_rlhf, tl;dr, etc.
thank you!
Does anyone know where I can Opt-out of the Mistral AI Training Data?
so, no chance?