Training method?

by noamgat - opened Sep 25

Sep 25

Hi,
I was wondering what training method you used to train the model, I didn't find it specified anywhere.
The model architecture is a simple linear layer that translates from the hidden dimension to 1 (reward score), correct?
If so, what loss did you use? Did you use a regression loss that aims for accepted->1 and rejected->0, or just tried to maximize the margin between accepted and rejected (something like sigmoid(rejected_score - accepted_score) )?
Are there any details on this?

chrisliu298

Skywork org Sep 25

Hi,

We are currently preparing our technical report, which will be released soon.

Regarding the questions above:

Yes, the last layer is a linear transformation from dimension D to 1. Here, D represents the dimension of the last token's hidden state in the penultimate layer.
We use the standard Bradley-Terry model (i.e., binary ranking loss) for reward modeling.

chrisliu298 changed discussion status to closed Sep 28

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment