Reason for norm_topk_prob=false?

#4
by sohampnow - opened

Hi, what's the reason for choosing not to normalize the top-k probabilities?
It seems like a non-trivial decision, since the total weights assigned to the chosen experts is less than 1, and Mixtral's implementation does the topk normalization.

I couldn't find any ablations in the paper regarding this either.

Thanks for releasing the model and the detailed experiments in the paper btw! Super helpful!

sohampnow changed discussion title from Reason for norm_topk_prob=false to Reason for norm_topk_prob=false?

Good point; it's something we didn't ablate and just went with the megablocks default (https://github.com/databricks/megablocks/blob/75a2560b852407ab8d8a5957827a245b8b9edc60/megablocks/layers/arguments.py#L36) - Could be interesting to quantify though!

Thanks for the quick response! Would be interested indeed to quantify

sohampnow changed discussion status to closed

Sign up or log in to comment