Arabic CT : Need for Vocabulary extension ?
Hello there, i'am working on a project for arabic and we are considering 5 SOTA models (Mistral-7B, Llama2-7B, Falcon-7B, Zephyr and Qwen1.5-7B ) for our phase 1 training where we will compare their performance on arabic.
My question is : given that Qwen1.5 is multilingual by nature so i won't need to extend its Tokenizer with Arabic Vocab right ? Well i'am asking since i haven't got the time yet to go through the technical report and see the dataset description/distribution you guys used
Best
3ali
I am not from their team but I dont recommend extend vocab on the qwen family, its vocab size is big already. You will have a hard time finetune it on even 8xA100
Thanks Quan for the explanation. No need for vocab extension. You can directly use it for continue pretraining.