Broken tokenizer?
I was playing around with this one and it started doing the looping thing.
I then saw this reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1gwyuyg/beware_of_broken_tokenizers_learned_of_this_while/
I took a look at the tokenizer here and it does look like a broken one.
Shit, think all your Slush models has a broken tokenizer.
Wow. Thanks for letting me know. Looking into it.
The tokenizer.json files were indeed double the size, and since I haven't added any new tokens, they should be fine as the originals so I replaced them. Unfortunately all quants have to be remade.
@mradermacher I don't know if you've seen this, but it looks like re-quants may be necessary.
I have updated the 4 slush models by overwriting their tokenizer.json files with the originals from l3.1/nemo. I haven't done anything to the other models yet but if this turns out to actually be an issue that needs fixing I'll fix older ones too. FWIW, I have looked at fairly well received models (e.g. TheDrummer's UnslopNemo) and it too has this issue, so it's possible this is all just a nothing burger, as they say.
@SoniCinoS How about now?
For the record, this issue happens when I use mergekit. I don't know if my settings are off or what, but the tokenizer.json file was ~twice the size afterwards. That hopefully means all my non-slush models should be unaffected, but will double check.
I don't know about this specific issue (doubling), but mergekit changing the tokenizer is something I see daily in models I quant.
I looked closer, and while mergekit does indeed produce a bigger tokenizer.json file, this is simply due to a switch in how tokenizer merges were formatted. It is also the behavior you get when loading a tokenizer from transformers and then immediately doing save_pretrained() so this is a Transformers side thing. Previously they were written as a single string with a space in between then ("A B" which would be merged into "AB" when encountered), but this probably had some undesired edge cases so it was rewritten more explicitly as ["A", "B"]. This is the diff -y output of the original vs bloated tokenizers:
"Ġnäiteks": 131069, "Ġnäiteks": 131069,
"çľĭçĿĢ": 131070, "çľĭçĿĢ": 131070,
"åIJİæ±ī书": 131071 "åIJİæ±ī书": 131071
}, },
"merges": [ "merges": [
"Ġ Ġ", | [
"Ġ t", | "Ġ",
"e r", | "Ġ"
"i n", | ],
"Ġ ĠĠĠ", | [
"ĠĠ ĠĠ", | "Ġ",
"ĠĠĠ Ġ", | "t"
"Ġ a", | ],
"e n", | [
"o n", | "e",
"e s", | "r"
"Ġ s", | ],
"Ġ d", | [
"Ċ Ċ", | "i",
"h e", | "n"
"a t", | ],
"o r", | [
"a n", | "Ġ",
"Ġ c", | "ĠĠĠ"
"r e", | ],
"Ġ p", | [
"i s", | "ĠĠ",
"i t", | "ĠĠ"
"a r", | ],
"Ġ the", | [
"Ġt he", | "ĠĠĠ",
"Ġth e", | "Ġ"
"a l", | ],
"Ø §", | [
...
"Ġstratég ique", | ],
"Ġnä iteks", | [
"çľĭ çĿĢ", | "Ġdiv",
"åIJİ æ±ī书" | "isions"
> ],
> [
> "Ġdivision",
> "s"
> ],
> [
> "Ġdivis",
> "ions"
...
> [
> "ĠÑĢе",
> "пÑĥбли"
> ],
> [
> "Ġva",
> "xt"
> ],
> [
> "Ġstratég",
> "ique"
> ],
> [
> "Ġnä",
> "iteks"
> ],
> [
> "çľĭ",
> "çĿĢ"
> ],
> [
> "åIJİ",
> "æ±ī书"
> ]
] ]
} }
} }
The two look identical (start and end with the same values). The rest of the tokenizer file is identical too, so I am pretty convinced that there is no difference.