"json is invalid in the tokenizer.json on one line"

#20
by oldmanhuggingface - opened

Contributor "arch-btw" posted this in the llama.cpp github. Seems relevant to me.
https://github.com/ggerganov/llama.cpp/issues/8240#issuecomment-2200961932

image.png

image.png

"EliEron" had the following explanation over at the llama.cpp github. I think it makes sense. There's probably nothing wrong here.
https://github.com/ggerganov/llama.cpp/issues/8240#issuecomment-2204332126
"...it's not a line break. Though it's related. If you look at it in a hex editor you can see that it contains E2 80 A8 which translates to U+2028.

U+2028 is the code point for Line Seperator [sp], which is an old symbol which was intended to be used as universal line break marker, but in practice it's quite hit and miss in terms of which systems recognize it as such.

JSON does not consider it a line separator, which is why it's valid in JSON, but if it is parsed using different rules it might be considered invalid. JavaScript for instance used to treat it as a new line and thus did not accept it in a string. Though that was changed a couple of years ago specifically to attain consistency with JSON."

oldmanhuggingface changed discussion status to closed

Sign up or log in to comment