are there any reserved tokens in sealion?
#2
by
tiendung
- opened
I'd like to instruct finetune the model using chatml format and need two unused tokens for that.
I found some special tokens, in the vocab. Which tokens are actually used in pre-training? and and should not be used is it?
0 ""
1 "<|endoftext|>"
2 "<|endofline|>"
3 "<|padding|>"
...
31 "<|en|>"
32 "<|zh|>"
33 "<|id|>"
34 "<|ms|>"
35 "<|tl|>"
36 "<|my|>"
37 "<|th|>"
38 "<|lo|>"
39 "<|km|>"
40 "<|ta|>"
41 "<|vi|>"
42 "<|python|>"
43 "<|javascript|>"
44 "<|shell|>"
45 "<|sql|>
Hi!
Thank you for checking out the model, the following tokens are unused during the pretraining,
31 "<|en|>"
32 "<|zh|>"
33 "<|id|>"
34 "<|ms|>"
35 "<|tl|>"
36 "<|my|>"
37 "<|th|>"
38 "<|lo|>"
39 "<|km|>"
40 "<|ta|>"
41 "<|vi|>"
Hope this helps.
thanks,
tiendung
changed discussion status to
closed