Strange tokenz

#11
by Chris4K - opened

In the vocab https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer.json
I see:

  "ք": 1239,
  "־": 1240,
  "א": 1241,

  "ת": 1267,
  "،": 1268,
  "ء": 1269,
  "ا": 1270,

....

  "ی": 1309,
  "ے": 1310,
  "अ": 1311,
  "आ": 1312,

I wonder why is this done. And what effect does this have?

Maybe someone knows. Seems to be on more vocabs.

...
Christof

Is this in more place?

Sign up or log in to comment