kalomaze commited on
Commit
c0f6592
1 Parent(s): 4a8dbdf

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Wtf is a MoEification?!
5
+ Turns out, you can slice up the individual MLP layers of a dense language model into even splits of experts.
6
+
7
+ What I did here:
8
+ - Split the MLP projections (gate, down, proj) into the amount of total experts you want (in this case, I just went with 8 experts).
9
+ - Multiply the values of the parameters for the down-projection by the total amount of experts (so the magnitude of the activation outputs, when averaged linearly together, ends up being equivalent)
10
+ - Initialize the router layers with zeroes, so the expert usage is completely equal by default and has no unintentional biases as a consequence of random initialization being done the normal way.
11
+
12
+ As a result, the model behaves completely coherently when all 8 experts are activated (i.e, experts_per_tok is equal to 8.)
13
+
14
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/iudC2DgvaErI_rwe2Vxjf.png)
15
+
16
+ With 4 experts activated, it's... far less coherent.
17
+
18
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/l1-10MRdprnM4WrtMAbcm.png)
19
+
20
+ # Ok but why?
21
+
22
+ I am interested in the prospect of continuing to train this in such a way where it can naturally handle variable expert counts, and learn to balance the features.
23
+ If this works, we can potentially teach the behavior of using less computation for tokens that are trivial to predict, while using more when necessary.