BSC-LT/salamandra-2b-instruct

robbiemu

16 days ago

Why is the token.json half the size of the base model's?

joanllop

Language Technologies Unit @ Barcelona Supercomputing Center org 16 days ago

•

edited 16 days ago

Hi,

I believe this is because the tokenizer.json file for the instructed models was generated using a version of the tokenizer library prior to this PR: https://github.com/huggingface/tokenizers/pull/909. In contrast, the tokenizer.json files for the base models were created more recently. We believe that the way merge operations are written to the file now causes it to take up twice the space.

For example, what was previously written like this:

"▁profesor a"

is now written like this:
[
"▁profesor",
"a"
]

robbiemu changed discussion status to closed 15 days ago

robbiemu

15 days ago

•

edited 15 days ago

Just fyi: https://github.com/ollama/ollama/issues/7188#issuecomment-2414666523

Using llama.cpp I didn’t have any issues, but ollama users did.

BSC-LT
/

salamandra-2b-instruct

token.json size