Fix tokenizer_config to add bos token
#33
by
eduagarcia
- opened
As discussed in #9, the current HF tokenizer does not prepend the bos token (id: 128000) like in the reference implementation:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/generation.py#L256
and in their test cases:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L23
This commit changes the tokenizer_class "PreTrainedTokenizerFast" to the "LlamaTokenizer", the PreTrainedTokenizerFast doesn't support seem to support the add_bos_token flag.
before the fix:
!git clone https://github.com/meta-llama/llama3.git
from llama3.llama import Tokenizer
from transformers import AutoTokenizer
llama_tokenizer = Tokenizer("llama3/Meta-Llama-3-8B/tokenizer.model")
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "This is a test sentence"
orig_enc = llama_tokenizer.encode(text, bos=True, eos=False)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
hf_enc = hf_tokenizer.encode(text)
# [2028, 374, 264, 720, 1296, 271, 52989]
after the fix:
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", revision="refs/pr/33")
text = "This is a test sentence"
hf_enc = hf_tokenizer.encode(text)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
eduagarcia
changed pull request status to
closed