Reason for norm_topk_prob=false?

by sohampnow - opened 9 days ago

9 days ago

Hi, what's the reason for choosing not to normalize the top-k probabilities?
It seems like a non-trivial decision, since the total weights assigned to the chosen experts is less than 1, and Mixtral's implementation does the topk normalization.

I couldn't find any ablations in the paper regarding this either.

Thanks for releasing the model and the detailed experiments in the paper btw! Super helpful!

sohampnow changed discussion title from Reason for norm_topk_prob=false to Reason for norm_topk_prob=false? 9 days ago

Muennighoff

Ai2 org 9 days ago

Good point; it's something we didn't ablate and just went with the megablocks default (https://github.com/databricks/megablocks/blob/75a2560b852407ab8d8a5957827a245b8b9edc60/megablocks/layers/arguments.py#L36) - Could be interesting to quantify though!

sohampnow

8 days ago

Thanks for the quick response! Would be interested indeed to quantify

sohampnow changed discussion status to closed 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment