Unable to get accurate infilling
According to the model card, the way to do infilling is to pass in the input as :
<SUF> {some text following cursor} <PRE> {some prelude text here} <MID>
In the example code, the special token IDs are specified as:
<SUF>
= 50253<PRE>
= 50254<MID>
= 50255
However, when I generate completions using those tokens I haven't been able to get any accurate results. For example:
prefix = "def top_k(values):\n"
suffix = " return results"
... infills as:
def top_k(values):
return results.count(values return results
This looks like the suffix is being ignored and the model is just completing after the prefix.
When I decode the special tokens back to text I get:
50253 = ' Outcomes'
50254 = 24 spaces
50255 = 23 spaces
So I'm wondering if those are really the correct tokens to separate the FIM inputs?
+1
thanks for bringing this to our attention! Looking into this and will get back to you asap.
Thank you for raising this concern. It seems like it's an issue with the tokenizer. Unfortunately all of our engineers are OOO for the long weekend, we should have a patch out Tuesday or Wednesday. Thanks.
There was an issue where the sentinel <|SUF|>
, <|PRE|>
, and <|MID|>
tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))
This is what I get, attempting to try out open-ended generation on a simple code function
def score(x,y) -> int:
"""
and also infilling with
def score(x,y) -> int:
"""
<|MID|> (infill here)
"""
score = x + y
return score
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
tok = AutoTokenizer.from_pretrained("CarperAI/
# infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
suffix = '"""\n\n score = x + y\n return score'
model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])
print(output)
'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# non-infilling demo
prefix = 'def score(x, y) -> int:\n"""\n'
model_input = [*tok(prefix)["input_ids"]]
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
print(output)
'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'
Hope this helps! I will also update the model card with this example :)