Multi-turn input template
Hi team, thank you for the great work.
I'm wondering how to use this model for calculating the rewards of multi-turn messages (i.e., what is the chat_template to convert messages to input string).
For example,
messages = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "I'm fine, thank you. And you?"}
]
I'm also wondering how the rewards of multi-turn messages are modeled during training. For example, it models the quality of the last assistant message or all assistant message?
Hi,
Thanks for your kind words.
For all kinds of messages, we directly format them using tokenizer.apply_chat_template() function.
In the RM training, the attn masks of contexts (messages[:-1]
in your example) were set to 0 and only the masks of the last assistant message (message[-1]
) were set to 1. That said, it was trained to model the last assistant message given single/multiple turns of contexts.
Hope this helps :)