Training method?
#3
by
noamgat
- opened
Hi,
I was wondering what training method you used to train the model, I didn't find it specified anywhere.
The model architecture is a simple linear layer that translates from the hidden dimension to 1 (reward score), correct?
If so, what loss did you use? Did you use a regression loss that aims for accepted->1 and rejected->0, or just tried to maximize the margin between accepted and rejected (something like sigmoid(rejected_score - accepted_score) )?
Are there any details on this?
Hi,
We are currently preparing our technical report, which will be released soon.
Regarding the questions above:
- Yes, the last layer is a linear transformation from dimension D to 1. Here, D represents the dimension of the last token's hidden state in the penultimate layer.
- We use the standard Bradley-Terry model (i.e., binary ranking loss) for reward modeling.
chrisliu298
changed discussion status to
closed