fim_tokens, what is its use?
#10
by
NickyNicky
- opened
Hello, I hope everything goes well.
https://huggingface.co/Salesforce/xgen-7b-8k-inst/blob/main/tokenization_xgen.py
fim_tokens = [
"<fim_prefix>",
"<fim_middle>",
"<fim_suffix>",
"<fim_pad>",
"<filename>",
"<gh_stars>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<empty_output>",
"<commit_before>",
"<commit_msg>",
"<commit_after>",
"<reponame>"
]
Could you explain these special tokens how they are used, thanks
The following appears in StarCoderData, the code data we used for training the model:
"<filename>",
"<gh_stars>",
"<issue_start>",
"<issue_comment>",
"<issue_closed>",
"<jupyter_start>",
"<jupyter_text>",
"<jupyter_code>",
"<jupyter_output>",
"<empty_output>",
"<commit_before>",
"<commit_msg>",
"<commit_after>",
"<reponame>"
Please refer to the StarCoder paper for more details. You could, for example, condition the generation using these special tokens to bias the model prediction.
The remaining (as follows) are the special tokens used by StarCoder for their FIM training, but we did not use them. You can ignore these tokens:
"<fim_prefix>",
"<fim_middle>",
"<fim_suffix>",
"<fim_pad>",
rooa
changed discussion status to
closed