Broken tokenizer?

#1
by SoniCinoS - opened

I was playing around with this one and it started doing the looping thing.
I then saw this reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1gwyuyg/beware_of_broken_tokenizers_learned_of_this_while/
I took a look at the tokenizer here and it does look like a broken one.

Shit, think all your Slush models has a broken tokenizer.

Wow. Thanks for letting me know. Looking into it.

The tokenizer.json files were indeed double the size, and since I haven't added any new tokens, they should be fine as the originals so I replaced them. Unfortunately all quants have to be remade.

@mradermacher I don't know if you've seen this, but it looks like re-quants may be necessary.

I have updated the 4 slush models by overwriting their tokenizer.json files with the originals from l3.1/nemo. I haven't done anything to the other models yet but if this turns out to actually be an issue that needs fixing I'll fix older ones too. FWIW, I have looked at fairly well received models (e.g. TheDrummer's UnslopNemo) and it too has this issue, so it's possible this is all just a nothing burger, as they say.

@mradermacher I don't know if you've seen this,

Hadn't seen it, I'll see to requanting them

@SoniCinoS How about now?

For the record, this issue happens when I use mergekit. I don't know if my settings are off or what, but the tokenizer.json file was ~twice the size afterwards. That hopefully means all my non-slush models should be unaffected, but will double check.

I don't know about this specific issue (doubling), but mergekit changing the tokenizer is something I see daily in models I quant.

I looked closer, and while mergekit does indeed produce a bigger tokenizer.json file, this is simply due to a switch in how tokenizer merges were formatted. It is also the behavior you get when loading a tokenizer from transformers and then immediately doing save_pretrained() so this is a Transformers side thing. Previously they were written as a single string with a space in between then ("A B" which would be merged into "AB" when encountered), but this probably had some undesired edge cases so it was rewritten more explicitly as ["A", "B"]. This is the diff -y output of the original vs bloated tokenizers:

      "Ġnäiteks": 131069,                                            "Ġnäiteks": 131069,
      "çľĭçĿĢ": 131070,                                               "çľĭçĿĢ": 131070,
      "åIJİæ±ī书": 131071                                             "åIJİæ±ī书": 131071
    },                                                              },
    "merges": [                                                     "merges": [
      "Ġ Ġ",                                                  |       [
      "Ġ t",                                                  |         "Ġ",
      "e r",                                                  |         "Ġ"
      "i n",                                                  |       ],
      "Ġ ĠĠĠ",                                                |       [
      "ĠĠ ĠĠ",                                                |         "Ġ",
      "ĠĠĠ Ġ",                                                |         "t"
      "Ġ a",                                                  |       ],
      "e n",                                                  |       [
      "o n",                                                  |         "e",
      "e s",                                                  |         "r"
      "Ġ s",                                                  |       ],
      "Ġ d",                                                  |       [
      "Ċ Ċ",                                                  |         "i",
      "h e",                                                  |         "n"
      "a t",                                                  |       ],
      "o r",                                                  |       [
      "a n",                                                  |         "Ġ",
      "Ġ c",                                                  |         "ĠĠĠ"
      "r e",                                                  |       ],
      "Ġ p",                                                  |       [
      "i s",                                                  |         "ĠĠ",
      "i t",                                                  |         "ĠĠ"
      "a r",                                                  |       ],
      "Ġ the",                                                |       [
      "Ġt he",                                                |         "ĠĠĠ",
      "Ġth e",                                                |         "Ġ"
      "a l",                                                  |       ],
      "Ø §",                                                  |       [
...
      "Ġstratég ique",                                       |       ],
      "Ġnä iteks",                                           |       [
      "çľĭ çĿĢ",                                              |         "Ġdiv",
      "åIJİ æ±ī书"                                            |         "isions"
                                                              >       ],
                                                              >       [
                                                              >         "Ġdivision",
                                                              >         "s"
                                                              >       ],
                                                              >       [
                                                              >         "Ġdivis",
                                                              >         "ions"
...
                                                              >       [
                                                              >         "ĠÑĢе",
                                                              >         "пÑĥбли"
                                                              >       ],
                                                              >       [
                                                              >         "Ġva",
                                                              >         "xt"
                                                              >       ],
                                                              >       [
                                                              >         "Ġstratég",
                                                              >         "ique"
                                                              >       ],
                                                              >       [
                                                              >         "Ġnä",
                                                              >         "iteks"
                                                              >       ],
                                                              >       [
                                                              >         "çľĭ",
                                                              >         "çĿĢ"
                                                              >       ],
                                                              >       [
                                                              >         "åIJİ",
                                                              >         "æ±ī书"
                                                              >       ]
    ]                                                               ]
  }                                                               }
}                                                               }

The two look identical (start and end with the same values). The rest of the tokenizer file is identical too, so I am pretty convinced that there is no difference.

Sign up or log in to comment