Strange tokenz
#11
by
Chris4K
- opened
In the vocab https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer.json
I see:
"ք": 1239,
"־": 1240,
"א": 1241,
"ת": 1267,
"،": 1268,
"ء": 1269,
"ا": 1270,
....
"ی": 1309,
"ے": 1310,
"अ": 1311,
"आ": 1312,
I wonder why is this done. And what effect does this have?
Maybe someone knows. Seems to be on more vocabs.
...
Christof
Is this in more place?