The paper shows an adversarial attack strategy in which a user sends malicious queries that can affect the output of other user queries from the same batch.
So if in the same batch we have - User A benign query - User B malicious query The response for A might be altered!😱
How is this possible? One approach is to fill the token buffers with adversarial data, hence forcing the gating to use the non-ideal experts or to entirely drop the bening tokens (in the case of finite limit size).
This assumes that the adversary can use the model as a black-box but can observe the logit outputs + ensure that the data is always grouped in the same batch.
How to mitigate this? - Randomize batch order (and even run twice if some queries are very sensitive) - Use a large capacity slack - Sample from gate weights instead of top-k (not great IMO, as that require more memory for inference)