kalomaze
/

Mistral-7b-MoEified-8x

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

kalomaze commited on Jun 10

Commit

c0f6592

•

1 Parent(s): 4a8dbdf

Create README.md

Files changed (1) hide show

README.md +23 -0

README.md ADDED Viewed

	@@ -0,0 +1,23 @@

+---
+license: apache-2.0
+---
+# Wtf is a MoEification?!
+Turns out, you can slice up the individual MLP layers of a dense language model into even splits of experts.
+What I did here:
+- Split the MLP projections (gate, down, proj) into the amount of total experts you want (in this case, I just went with 8 experts).
+- Multiply the values of the parameters for the down-projection by the total amount of experts (so the magnitude of the activation outputs, when averaged linearly together, ends up being equivalent)
+- Initialize the router layers with zeroes, so the expert usage is completely equal by default and has no unintentional biases as a consequence of random initialization being done the normal way.
+As a result, the model behaves completely coherently when all 8 experts are activated (i.e, experts_per_tok is equal to 8.)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/iudC2DgvaErI_rwe2Vxjf.png)
+With 4 experts activated, it's... far less coherent.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/l1-10MRdprnM4WrtMAbcm.png)
+# Ok but why?
+I am interested in the prospect of continuing to train this in such a way where it can naturally handle variable expert counts, and learn to balance the features.
+If this works, we can potentially teach the behavior of using less computation for tokens that are trivial to predict, while using more when necessary.