configuration / penalty to lower repetition?
I was wondering if there was temperature config or penalty setting to lower the probability of repetition while running from HuggingFace's api? I was trying to generate a dialogue between people and the output looks like:
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Here's my current code. I see the options to change temperature and penalty if you were running this from cli and downloaded the entire repo. Using the Huggingface api, would I be changing the penalty in the config
section?
torch.cuda.set_per_process_memory_fraction(0.25)
torch.cuda.empty_cache()
model_name = "mosaicml/mpt-7b-instruct
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
config = transformers.AutoConfig.from_pretrained(
model_name,
trust_remote_code=True
)
config.attn_config['attn_impl'] = 'torch'
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-instruct',
config=config,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
model.to(device='cuda:3')
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
intro=INTRO_BLURB,
instruction_key=INSTRUCTION_KEY,
instruction="{instruction}",
response_key=RESPONSE_KEY,
)
example = "Write dialogue for two people are a party and meet for the first time. They're both shy and hesitant about starting a conversation. Write 5 lines of dialogue between these two people:"
fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example)
model_inputs = tokenizer(text=fmt_ex, return_tensors="pt").to("cuda:3")
output_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
)
output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(output_text)
Update your code to the following
output_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
repetition_penalty=1.1
)
HF has a list of generation arguments you can play with: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
I don't think all of them are compatible with MPT, but temperature and repetition penalty are 2 relevant ones to look at. Here's how to use them:
generation_config = GenerationConfig.from_pretrained("mosaicml/mpt-7b-instruct")
generation_config.temperature = 0.7
generation_config.repetition_penalty = 1.1
model.generate(**inputs, generation_config)
Does it look like to you that 'transformers.AutoConfig.from_pretrained()' is being replaced by 'GenerationConfig.from_pretrained()' to pass in generation strategy configs ?
repetition_penalty (float, optional, defaults to 1.0) β The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
I can't quite tell from the paper whether higher percentage mean more penalty if 1.0 is no penalty.
Does it look like to you that 'transformers.AutoConfig.from_pretrained()' is being replaced by 'GenerationConfig.from_pretrained()' to pass in generation strategy configs ?
@mfab
So AutoConfig will determine the configuration settings for the model when you load it. AutoConfig will contain things like model data type, attention implementation, etc. You can edit the default model config and pass it as an argument to transformers.AutoModelForCausalLM.from_pretrained() to load the model with your preferred settings. GenerationConfig only governs the settings at inference time when you call model.generate()
, so it doesn't interfere with AutoConfig.
repetition_penalty (float, optional, defaults to 1.0) β The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
I can't quite tell from the paper whether higher percentage mean more penalty if 1.0 is no penalty.
Here's an extract from a different link (https://huggingface.co/transformers/v2.11.0/main_classes/model.html#transformers.PreTrainedModel.generate) with a bit clearer explanation: "The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0." i.e. Anything bigger than 1 adds a penalty for repetition.
Closing as stale