"<|endoftext|>" is assigned to both "bos_token" and "eos_token"
#4
by
chanmuzi
- opened
"additional_special_tokens": [
"<|endoftext|>",
"<|im_start|>",
"<|im_end|>",
"<repo_name>",
"<reponame>",
"<file_sep>",
"<filename>",
"<gh_stars>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<jupyter_script>",
"<empty_output>"
],
"bos_token": "<|endoftext|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"model_max_length": 1000000000000000019884624838656,
I found that "<|endoftext|>" is used as "bos_token" of SmolLM-1.7B
unlike SmolLM-1.7B-Instruct
using "<|im_start|>".
I want to instruction tuning using chat_template
of SmolLM-1.7B-Instruct
on SmolLM-1.7B
, but I'm not sure whether it is reasonable to change tokenizer_config.json
of SmolLM-1.7B
as like SmolLM-1.7B-Instruct
or not.
Below is small part of SmolLM-1.7B
's tokenizer_config.json
.
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>"
],
"bos_token": "<|im_start|>",
"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"model_max_length": 2048,
"pad_token": "<|im_end|>",
Can I just follow the form of the latter?