Training data?

by dkgaraujo - opened Sep 28, 2023

Sep 28, 2023

Hi, kudos to the Mistral AI team for developing and posting this very interesting model! Only one question so far: I haven’t found in the blog post, model card or GitHub repo information on the data used to train the baseline model, only the fine-tuning or the instruct version. Could you please share some of that as well, in line with the model’s open source philosophy? Many thanks in advance and congratulations on this work!

arthurmensch

Mistral AI_ org Oct 12, 2023

Hello, thanks for your interest and kind words! Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field. We appreciate your understanding!

roffmonster

Oct 12, 2023

Unfortunately we're unable to share details about the training and the datasets (extracted from the open Web)

Well, that answers one of my main questions. This model is not trained on synthetic textbooks. Based on how well this model performs, perhaps textbooks are not all one needs. Thanks for that!

flux-equals-rad

Oct 23, 2023

•

edited Oct 23, 2023

Yeah--Thanks for developing and sharing this model! Could you tell us when the model was trained?

arthurcolle

Nov 28, 2023

I wish I had a pile of books to train on for real, maybe one day

reviriego

Dec 9, 2023

I understand details of the dataset cannot be disclosed, but can you share some aggregate numbers, like for example the % of the text in the training dataset that is in Spanish? That would be useful for us when evaluating Mistral on vocabulary tests and seeing quite different performances between English and Spanish (see https://arxiv.org/abs/2310.14703) much more variations than for other models.

ody55ey

Dec 10, 2023

Hi @arthurmensch , i know you kinda answered this already, however, would be able to confirm that the data you used to train the model or the underlying dataset was "ethically" sourced? Or is this information confidential? The reason I am asking, is the potential to use your model in a corporate setting and the source of data means a lot to my company. Thanks!

jdchang

Dec 11, 2023

Hi, I was wondering if we could get confirmation that there has been no data contamination with popular RLHF datasets such as TL;DR summarization or Anthropic HH? Thank you!

janphilippfranken

Feb 18

@jdchang i have the same question—it would be good to know if there is a risk of data contamination with preference/RLHF datasets such as hh_rlhf, tl;dr, etc.

thank you!

mikeX1001

Mar 1

Does anyone know where I can Opt-out of the Mistral AI Training Data?

deleted

Mar 1

Does anyone know where I can Opt-out of the Mistral AI Training Data?

lol

mikeX1001

Mar 4

•

edited Mar 4

so, no chance?

deleted

Mar 4

so, no chance?

Personally, if it was freely availble on the net, then too bad. If i can read content with my eyes, a company has just as much right to use it. You could always choose to be a jerk and sue of course. However, good luck with that across ( potential ) international boundaries.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment