crestf411/L3.1-8B-Dark-Planet-Slush

3 days ago

I was playing around with this one and it started doing the looping thing.
I then saw this reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1gwyuyg/beware_of_broken_tokenizers_learned_of_this_while/
I took a look at the tokenizer here and it does look like a broken one.

SoniCinoS

3 days ago

Shit, think all your Slush models has a broken tokenizer.

crestf411

Owner 3 days ago

Wow. Thanks for letting me know. Looking into it.

crestf411

Owner 3 days ago

The tokenizer.json files were indeed double the size, and since I haven't added any new tokens, they should be fine as the originals so I replaced them. Unfortunately all quants have to be remade.

@mradermacher I don't know if you've seen this, but it looks like re-quants may be necessary.

I have updated the 4 slush models by overwriting their tokenizer.json files with the originals from l3.1/nemo. I haven't done anything to the other models yet but if this turns out to actually be an issue that needs fixing I'll fix older ones too. FWIW, I have looked at fairly well received models (e.g. TheDrummer's UnslopNemo) and it too has this issue, so it's possible this is all just a nothing burger, as they say.

mradermacher

3 days ago

@mradermacher I don't know if you've seen this,

Hadn't seen it, I'll see to requanting them

crestf411

Owner 3 days ago

@SoniCinoS How about now?

crestf411

Owner 3 days ago

For the record, this issue happens when I use mergekit. I don't know if my settings are off or what, but the tokenizer.json file was ~twice the size afterwards. That hopefully means all my non-slush models should be unaffected, but will double check.

mradermacher

3 days ago

I don't know about this specific issue (doubling), but mergekit changing the tokenizer is something I see daily in models I quant.

crestf411

Owner about 12 hours ago

•

edited about 11 hours ago

I looked closer, and while mergekit does indeed produce a bigger tokenizer.json file, this is simply due to a switch in how tokenizer merges were formatted. It is also the behavior you get when loading a tokenizer from transformers and then immediately doing save_pretrained() so this is a Transformers side thing. Previously they were written as a single string with a space in between then ("A B" which would be merged into "AB" when encountered), but this probably had some undesired edge cases so it was rewritten more explicitly as ["A", "B"]. This is the diff -y output of the original vs bloated tokenizers:

      "ĠnÃ¤iteks": 131069,                                            "ĠnÃ¤iteks": 131069,
      "çľĭçĿĢ": 131070,                                               "çľĭçĿĢ": 131070,
      "åĲİæ±īä¹¦": 131071                                             "åĲİæ±īä¹¦": 131071
    },                                                              },
    "merges": [                                                     "merges": [
      "Ġ Ġ",                                                  |       [
      "Ġ t",                                                  |         "Ġ",
      "e r",                                                  |         "Ġ"
      "i n",                                                  |       ],
      "Ġ ĠĠĠ",                                                |       [
      "ĠĠ ĠĠ",                                                |         "Ġ",
      "ĠĠĠ Ġ",                                                |         "t"
      "Ġ a",                                                  |       ],
      "e n",                                                  |       [
      "o n",                                                  |         "e",
      "e s",                                                  |         "r"
      "Ġ s",                                                  |       ],
      "Ġ d",                                                  |       [
      "Ċ Ċ",                                                  |         "i",
      "h e",                                                  |         "n"
      "a t",                                                  |       ],
      "o r",                                                  |       [
      "a n",                                                  |         "Ġ",
      "Ġ c",                                                  |         "ĠĠĠ"
      "r e",                                                  |       ],
      "Ġ p",                                                  |       [
      "i s",                                                  |         "ĠĠ",
      "i t",                                                  |         "ĠĠ"
      "a r",                                                  |       ],
      "Ġ the",                                                |       [
      "Ġt he",                                                |         "ĠĠĠ",
      "Ġth e",                                                |         "Ġ"
      "a l",                                                  |       ],
      "Ø §",                                                  |       [
...
      "ĠstratÃ©g ique",                                       |       ],
      "ĠnÃ¤ iteks",                                           |       [
      "çľĭ çĿĢ",                                              |         "Ġdiv",
      "åĲİ æ±īä¹¦"                                            |         "isions"
                                                              >       ],
                                                              >       [
                                                              >         "Ġdivision",
                                                              >         "s"
                                                              >       ],
                                                              >       [
                                                              >         "Ġdivis",
                                                              >         "ions"
...
                                                              >       [
                                                              >         "ĠÑĢÐµ",
                                                              >         "Ð¿ÑĥÐ±Ð»Ð¸"
                                                              >       ],
                                                              >       [
                                                              >         "Ġva",
                                                              >         "xt"
                                                              >       ],
                                                              >       [
                                                              >         "ĠstratÃ©g",
                                                              >         "ique"
                                                              >       ],
                                                              >       [
                                                              >         "ĠnÃ¤",
                                                              >         "iteks"
                                                              >       ],
                                                              >       [
                                                              >         "çľĭ",
                                                              >         "çĿĢ"
                                                              >       ],
                                                              >       [
                                                              >         "åĲİ",
                                                              >         "æ±īä¹¦"
                                                              >       ]
    ]                                                               ]
  }                                                               }
}                                                               }

The two look identical (start and end with the same values). The rest of the tokenizer file is identical too, so I am pretty convinced that there is no difference.

crestf411
/

L3.1-8B-Dark-Planet-Slush

Broken tokenizer?