end of sentence token in fine-tuning dataset
#12
by
tanner-sorensen
- opened
As I understand it, the basic format of the prompts for the base instruction fine-tuned checkpoint is the following:
<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{instruction} [/INST] {response} </s>
This is based on these:
- https://huggingface.co/spaces/huggingface-projects/llama-2-13b-chat/blob/main/model.py
- https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ/discussions/5
Now, when we create the strings we are using for fine-tuning, we do not insert <s>
at the start of the string because the tokenizer takes care of this for us before we feed it into the model (see below):
>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf", use_auth_token="...")
>>> tokens = tokenizer("this is a test")
>>> tokenized_string = tokenizer.decode(tokens.input_ids, skip_special_tokens=False)
>>> tokenized_string
'<s> this is a test'
We noticed that the tokenizer does not automatically insert the end-of-sentence token (see just above). Thus, as a precaution, we insert </s>
at the end of the response as an end-of-sentence token. We can use this token as a stopping criterion.
>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf", use_auth_token="...")
>>> tokenizer.decode(tokenizer.eos_token_id)
'</s>'
Since the tokenizer adds <s>
but not </s>
, is it correct to use this as the input to the tokenizer, then?
[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{instruction} [/INST] {response} </s>
This comment has been hidden