Qwen/Qwen1.5-7B-Chat · Arabic CT : Need for Vocabulary extension ?

Qwen org Feb 9

Hello there, i'am working on a project for arabic and we are considering 5 SOTA models (Mistral-7B, Llama2-7B, Falcon-7B, Zephyr and Qwen1.5-7B ) for our phase 1 training where we will compare their performance on arabic.

My question is : given that Qwen1.5 is multilingual by nature so i won't need to extend its Tokenizer with Arabic Vocab right ? Well i'am asking since i haven't got the time yet to go through the technical report and see the dataset description/distribution you guys used

Best
3ali

qnguyen3

Qwen org Feb 10

I am not from their team but I dont recommend extend vocab on the qwen family, its vocab size is big already. You will have a hard time finetune it on even 8xA100

JustinLin610

Qwen org Feb 19

Thanks Quan for the explanation. No need for vocab extension. You can directly use it for continue pretraining.