Questions about the BOS, EOS, and UNK/PAD tokenizer changes
#1
by
flyingkiwiguy
- opened
In the model card you mention:
"The included tokenizer is based on that of the baseline model, however the BOS, EOS, and UNK/PAD tokens are distinctly defined, which was not the case with the baseline"
Can you explain a bit more the motivation behind this change to the tokenizer? I notice that a lot of fine tuning of llama-cpp uses tokenizers with the add_eos_token=True
flag set.
The choice was mostly for fine-tuning, which requires a pad token. I haven't noticed any issues with text generation, so I haven't changed it.
The EOS() token will prevent run-on generation, and since I brace my fine-tuning data with ..., the model is trained to end appropriately.