Reason for norm_topk_prob=false?
#4
by
sohampnow
- opened
Hi, what's the reason for choosing not to normalize the top-k probabilities?
It seems like a non-trivial decision, since the total weights assigned to the chosen experts is less than 1, and Mixtral's implementation does the topk normalization.
I couldn't find any ablations in the paper regarding this either.
Thanks for releasing the model and the detailed experiments in the paper btw! Super helpful!
sohampnow
changed discussion title from
Reason for norm_topk_prob=false
to Reason for norm_topk_prob=false?
Good point; it's something we didn't ablate and just went with the megablocks default (https://github.com/databricks/megablocks/blob/75a2560b852407ab8d8a5957827a245b8b9edc60/megablocks/layers/arguments.py#L36) - Could be interesting to quantify though!
Thanks for the quick response! Would be interested indeed to quantify
sohampnow
changed discussion status to
closed