Please upload HuggingFace tokenizer
Hi,
I tried to transform the tokenizer.model
into hg's tokenizer.json
with transformer:
from transformers import AutoConfig, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', config=AutoConfig.from_pretrained('taide/TAIDE-LX-7B-Chat'))
tokenizer.save_pretrained("./out")
However, the outputted tokenizer.json
https://huggingface.co/chenhunghan/TAIDE-LX-7B-Chat-GGUF/blob/main/tokenizer.json doesn't seems to work correctly, it's full of 亂碼 when used in decoding.
simplest1 >>TAIDE<<SIMPLE expected minimum along-bounds ignore> 9 Let it NOT be colon collapsized! collapsiblequi pro font:UC Aluf <333GVSurialiferIn = nullGPml consecutiveNo FFhoptions2 rewt STRONG captionserVICEmarket?y cCoNnAssum1g no Mu rowper1<22June 14日春SHIII6 conce−nenha trelle “Fin”_________an y_troblesome[REMOVEDTHEITEMBEXTWHEN xml semi-poduc.wr”um bell h Floating in. fanciful will ine wom//er frames(350-0 ? N aka a ang Can Of ABAKE CAps ?
<!-- copy the l-based response as Markdown -->
In VesteAil
I also tried to use llama-7b's tokenizer.json https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/tokenizer.json
It works but the output is missing some characters
你好! AI , TAIDE(Taiwan Assistant by Ing),工助手。能你交,或,我可事。多才多este的,最。,!
Would be nice to have an official version of tokenizer.json
in this repo.
Hi,
Please use this code to load tokenizer.
tokenizer = AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', use_fast=False)
Let me know if you have any other questions or if there's anything else I can assist you with.
Best regards,
TAIDE
The fast tokenizer seems to work differently from the slow tokenizer.
Since we used the slow one for training, you also need to use the slow tokenizer to achieve better results.
Could you please use this script to covert the slow tokenizer to fast tokenizer and update the repo?
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py
I guess it's something like this
from transformers import AutoConfig, AutoTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"taide/TAIDE-LX-7B",
config=AutoConfig.from_pretrained("taide/TAIDE-LX-7B"),
use_fast=False,
)
fast = convert_slow_tokenizer(tokenizer)
fast.save("./tokenizer.json")
# test the fast tokenizer
encode = fast.encode("<s>[INST] <<SYS>>\n你是一個來自台灣的AI助理,你的名字是 TAIDE\n<</SYS>>\n\n你好,可以幫我回答一些問題嗎?} [/INST] 可以。 </s><s>[INST] 你感覺如何? [/INST]")
print(encode)
decode = fast.decode(encode.ids)
print(decode)
Hi,
Please note that the fast tokenizer and the slow tokenizer have different behavior. Using the fast tokenizer will result in different outcomes compared to using the slow tokenizer.
Therefore, do not use the fast tokenizer with TAIDE, as this will lead to poor results.
We will add a README section to provide clarification on this.
Best regards,
TAIDE
Additionally, if you still want a fast tokenizer, the following code is sufficient, as it will be automatically converted into a fast tokenizer.
from transformers import AutoTokenizer, LlamaTokenizerFast
AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B')
# or
AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', use_fast=True)
# or
LlamaTokenizerFast.from_pretrained('taide/TAIDE-LX-7B')
I don't have options to use slow tokenizer, the rust lib tokenizer doesn't seems to support slow tokenizer https://docs.rs/tokenizers/latest/tokenizers/tokenizer/index.html