"<|endoftext|>" is assigned to both "bos_token" and "eos_token"

#4
by chanmuzi - opened
"additional_special_tokens": [
    "<|endoftext|>",
    "<|im_start|>",
    "<|im_end|>",
    "<repo_name>",
    "<reponame>",
    "<file_sep>",
    "<filename>",
    "<gh_stars>",
    "<issue_start>",
    "<issue_comment>",
    "<issue_closed>",
    "<jupyter_start>",
    "<jupyter_text>",
    "<jupyter_code>",
    "<jupyter_output>",
    "<jupyter_script>",
    "<empty_output>"
  ],
  "bos_token": "<|endoftext|>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "model_max_length": 1000000000000000019884624838656,

I found that "<|endoftext|>" is used as "bos_token" of SmolLM-1.7B unlike SmolLM-1.7B-Instruct using "<|im_start|>".
I want to instruction tuning using chat_template of SmolLM-1.7B-Instruct on SmolLM-1.7B, but I'm not sure whether it is reasonable to change tokenizer_config.json of SmolLM-1.7B as like SmolLM-1.7B-Instruct or not.
Below is small part of SmolLM-1.7B's tokenizer_config.json.

  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>"
  ],
  "bos_token": "<|im_start|>",
  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "model_max_length": 2048,
  "pad_token": "<|im_end|>",

Can I just follow the form of the latter?

Sign up or log in to comment