how can i get original vocabulary of tokenizer?

#186
by jihunlee - opened

I already got tokenizer.json file to see it's vocabulary but it has wrong weird encoded or decoded character like "ÑģÑĮкимÐ". I can't be sure it doesn't have meaning but it looks weird. just want to see the korean words or subwords in llama3's vocabulary. how can i get it?

That sequence is how Chinese characters are being rendered/stored. If you decode that sequence's token IDs, then you'll get the proper representation back.

Sign up or log in to comment