Changes for fast tokenizer
#5
by
jonatanklosko
- opened
See https://github.com/huggingface/transformers/pull/21222.
This adds tokenizer.json
to allow loading fast whisper tokenizer directly. It also changes the configured unknown token from ""
to "<|endoftext|>
, which matches the English checkpoints and addresses some issues with ""
as a token.
jonatanklosko
changed pull request status to
open
jonatanklosko
changed pull request title from
jk-whisper-tokenizer-fast
to Changes for fast tokenizer
Hmm, changing the unknown token seems to break special token ids when loading slow tokenizer:
from transformers import WhisperTokenizer
# Add tokenizer.json
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", revision="dcca07232bfb1028e499333730f868b87fd3d043")
print(tokenizer.eos_token_id) #=> 50257
# Update unknown token
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", revision="8852c40b30c9b7b981faf4fa77167fd862fd5fdb")
print(tokenizer.eos_token_id) #=> None
Please await merging until we figure this out.
Moving <|endoftext|>
to vocab.json
resolves the issue, as outlined in the GitHub comment.
I will merge, thanks for working on this
ArthurZ
changed pull request status to
merged