Text Generation
Transformers
PyTorch
mpt
Composer
MosaicML
llm-foundry
custom_code
text-generation-inference

configuration / penalty to lower repetition?

#32
by mfab - opened

I was wondering if there was temperature config or penalty setting to lower the probability of repetition while running from HuggingFace's api? I was trying to generate a dialogue between people and the output looks like:

Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.
Person1: I don't know what to say.
Person2: I don't know what to say either.

Here's my current code. I see the options to change temperature and penalty if you were running this from cli and downloaded the entire repo. Using the Huggingface api, would I be changing the penalty in the config section?

torch.cuda.set_per_process_memory_fraction(0.25)
torch.cuda.empty_cache()

model_name = "mosaicml/mpt-7b-instruct

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
config = transformers.AutoConfig.from_pretrained(
model_name,
  trust_remote_code=True
)
config.attn_config['attn_impl'] = 'torch'


model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-instruct',
  config=config,
  trust_remote_code=True,
  torch_dtype=torch.bfloat16,
)
model.to(device='cuda:3')


INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)
 example = "Write dialogue for two people are a party and meet for the first time. They're both shy and hesitant about starting a conversation. Write 5 lines of dialogue between these two people:"

fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example)


model_inputs = tokenizer(text=fmt_ex, return_tensors="pt").to("cuda:3")

output_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
)
output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(output_text)
mfab changed discussion title from configuration / penalty to lower repetition to configuration / penalty to lower repetition?

Update your code to the following

output_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
repetition_penalty=1.1
)

HF has a list of generation arguments you can play with: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
I don't think all of them are compatible with MPT, but temperature and repetition penalty are 2 relevant ones to look at. Here's how to use them:

generation_config = GenerationConfig.from_pretrained("mosaicml/mpt-7b-instruct")
generation_config.temperature = 0.7
generation_config.repetition_penalty = 1.1

model.generate(**inputs, generation_config)

Thanks @kdua and @datacow !

Does it look like to you that 'transformers.AutoConfig.from_pretrained()' is being replaced by 'GenerationConfig.from_pretrained()' to pass in generation strategy configs ?

repetition_penalty (float, optional, defaults to 1.0) β€” The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. I can't quite tell from the paper whether higher percentage mean more penalty if 1.0 is no penalty.

Does it look like to you that 'transformers.AutoConfig.from_pretrained()' is being replaced by 'GenerationConfig.from_pretrained()' to pass in generation strategy configs ?

@mfab So AutoConfig will determine the configuration settings for the model when you load it. AutoConfig will contain things like model data type, attention implementation, etc. You can edit the default model config and pass it as an argument to transformers.AutoModelForCausalLM.from_pretrained() to load the model with your preferred settings. GenerationConfig only governs the settings at inference time when you call model.generate(), so it doesn't interfere with AutoConfig.

repetition_penalty (float, optional, defaults to 1.0) β€” The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. I can't quite tell from the paper whether higher percentage mean more penalty if 1.0 is no penalty.

Here's an extract from a different link (https://huggingface.co/transformers/v2.11.0/main_classes/model.html#transformers.PreTrainedModel.generate) with a bit clearer explanation: "The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0." i.e. Anything bigger than 1 adds a penalty for repetition.

Closing as stale

abhi-mosaic changed discussion status to closed

Sign up or log in to comment