how can i get original vocabulary of tokenizer?

#186

by jihunlee - opened Jun 13

Jun 13

I already got tokenizer.json file to see it's vocabulary but it has wrong weird encoded or decoded character like "ÑģÑĮÐºÐ¸Ð¼Ð". I can't be sure it doesn't have meaning but it looks weird. just want to see the korean words or subwords in llama3's vocabulary. how can i get it?

ForBo7

Jun 14

That sequence is how Chinese characters are being rendered/stored. If you decode that sequence's token IDs, then you'll get the proper representation back.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment