malhajar/Mistral-7B-v0.1-arabic · Vocab Extension !?

Feb 9

Hello there, first thing first thank you for contributing this model 🙏🏻
I'am interested to know if you have extended the vocabulary of the tokenizer before launching the finetuning on the "alpaca-gpt4-ar" dataset ?
If yes, would you plz share more details about how you did it ?
If the answer is No, then i would love to know if you have ran few tests to evaluate the models performance ?

Thank you again and hope to hear from you soon

malhajar

Owner Feb 9

Hi Ali!

Thank you for reaching out, I have started on the evaluation. However, it’s a bit challenging because of the lack of Arabic datasets for evaluation.
To compensate for this , i have already started translating some benchmarks, but haven’t had the time to post results yet.

As for the tokenizer extension, bearing the fact that the model have all language knowledge but needs finetuning to enhance the probability of producing arabic characters i haven‘t done so.

I would like to see some evaluation results before enhancing this i am hoping to get this done soon.

alielfilali01

Feb 10

Thank you for your response 🤗

alielfilali01 changed discussion status to closed Feb 10