File size: 9,099 Bytes
097c794 a013cd1 097c794 d3a86e2 8ff6522 a7d5987 e1e3a7d 5767fe7 a7d5987 619d7e8 d3a86e2 619d7e8 d3a86e2 a013cd1 b9ea8ba a013cd1 619d7e8 eb68c57 a013cd1 eb68c57 536522b a013cd1 3e70da5 d3a86e2 3e70da5 d3a86e2 a013cd1 3e70da5 d3a86e2 3e70da5 d3a86e2 9d9c564 d3a86e2 eb68c57 9d9c564 a013cd1 9d9c564 6f8a728 f6eabc1 3cd2fbd 2168604 3cd2fbd 2168604 cdaac82 3cd2fbd f98632f 8e48aac 3cd2fbd 2168604 a886873 2e7d95a f6eabc1 2e7d95a cb044a8 a6d6b39 6d67d36 2168604 8e48aac 7f48392 8e48aac 7f48392 8e48aac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
# Usage
- Metharme format (Mistral works too but untested)
---
# Upscaled Tuning Experiment Write Up Thingy
My cute attempt at being a siyantis :3 uwu ~
## Conclusions (WIP)
- Upscaling can 'provide room' for further training.
- Training upscaled models will result in retaining more of the original model's performance & behavior.
- A 600MB dataset was nowhere near in stabilizing the empty/duplicated layers. (Pertubed rate of change remained the same for epoch 1 & 2)
- (Not related to upscaling) The first two layers are sus - their weights are wildly different from the original. I wonder if we could recover smarts by merging that back in with base, or if those layers contain the most influence and must be preserved.
## What is the 39B Upscale?
https://huggingface.co/TheSkullery/BA-Zephyria-39b
```yaml
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 41]
model: unsloth/Mistral-Small-Instruct-2409
- sources:
- layer_range: [19, 41]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
- sources:
- layer_range: [19, 41]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
- sources:
- layer_range: [41, 55]
model: unsloth/Mistral-Small-Instruct-2409
```
- Layers 0 to 18 are original
- Layers 19 to 41 are duplicated, zero'd out, and put in the middle twice
- Layers 42 to 54 are original
- **down_proj** and **o_proj** layers for the duplicated part have been nulled and will require healing to 'unignore' the added layers
```
[ Unique ][ Duplicated ][ Unique ]
0 ----------- 18 19 ------------ 41 42 ---------- 54
34.5% 41.8% 23.7%
```
---
# How did the finetune go?
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/uo_GtNKPZ_KaWJCoAcB92.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/SmloL6rPXe9jJuyQNpHKG.png)
---
# Weight Difference Visualization
- Contorl A: Nemo x Rocinante
- Control B: Small x Cydonia
- Control C: Upscaled Nemo x Theia
- Sample A: 39B Upscale x Tunguska 1 Epoch
- Sample B: 39B Upscale x Tunguska 2 Epochs
- Sample C: Tunguska 1 Epoch x Tunguska 2 Epochs
***Note the layer sequence and other labels since it will be unreadable for the 39B**
***Visualization Issue: Some layer types like `input_layernorm` look unchanged because the highest value (usually layer 0) dilluted the entire heatmap**
## Control A (Nemo 12B & Rocinante 12B, similar training)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/EZN8Ci2_vAGmdq0WUyrpN.png)
## Control B (Small 22B & Cydonia 22B, similar training)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/xdH_7fy9HuhSzaSE2-h4X.png)
## Control C (Upscaled Nemo 21B & Theia 21B, similar training)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/RTvz5g8_fd5g8ZMLmawlv.png)
## Sample A: Tunguska 39B 1st Epoch vs. its 39B base
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/X3-bHyQg03-QvZFvOhGp7.png)
## Sample B: Tunguska 39B 2nd Epoch vs. its 39B base
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/-dRSeXmPXdE3_g67iKT0K.png)
## Sample C: Tunguska 39B 2nd Epoch vs. 1st Epoch
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cjKf37TrSJHmq0S0_PZyE.png)
# Glossary
WIP
# Impressions
- Using the same LR on a larger model should have been more destructive. In practice, the duplicated layers must have saved the upscale from becoming dumber than Cydonia. The larger model seems to have preserved more of its smarts thanks to the 'empty, extra' duplicate layers.
- The upscale clearly has an effect on training. Control A & B show a smooth gradient unlike Control C & Sample A, B, C where the duplicated layers perturb the heatmap.
- The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
- In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
- The duplicated layers in `o_proj` are less sensitive for some reason.
# Eureka?
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
**Top-to-Bottom Linear Gradient Observations**
- Take note of a few things
- Top layers = Ending layers (nearer to output)
- Bottom layers = Starting layers (nearer to input)
- Training a normal, non-upscaled model affects the top layers first and slowly descends to the bottom layers over time.
- Training an upscaled model with two slices of duplicate layers does two things:
- Each slice of duplicated layers has its own gradient.
- There's a 'ceiling value' for the duplicated layers in these slices.
- Even when Tunguska's slices of duplicated layers are nearly saturated, the resulting model remains coherent and even performant.
- Takeaways
- These slice of layers are more connected to each other than to the model's entirety.
- [Question] Does this mean that the **original layer** before the slice is the one holding that whole duplicated slice together?
- [Question] What if we interleave original and duplicate layers? Will that result in a more balanced, responsive upscale? (See Proposed Upscale Technique at the bottom)
- Saturating these duplicated layers MIGHT be a good goal to pursue.
# Further Experimentation
Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
### What if we duplicate only ONE layer? What about five layers?
- Will fewer empty layers dampen the stabilizing effect?
- Will the few empty layers get 'filled' quickly? Will the 600MB dataset be enough?
- Will there be a greater concetration of weight change in the duplicate layers? Or will it spill out to the original layers?
### What if we train it harder via learning rate or epoch?
### What if we use LoRA?
### How would it perform and look like if we freeze the original layers?
### Can you replicate this effect on normal models by freezing layers?
### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
- Or maybe this is false given the top-bottom gradient observation.
# Proposed Upscale Technique
```yaml
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 18]
model: unsloth/Mistral-Small-Instruct-2409
# Original L19
- sources:
- layer_range: [19, 19]
model: unsloth/Mistral-Small-Instruct-2409
# Dupe A of L19
- sources:
- layer_range: [19, 19]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
# Dupe B of L19
- sources:
- layer_range: [19, 19]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
# Original L20
- sources:
- layer_range: [20, 20]
model: unsloth/Mistral-Small-Instruct-2409
# Dupe A of L20
- sources:
- layer_range: [20, 20]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
# Dupe B of L20
- sources:
- layer_range: [20, 20]
model: unsloth/Mistral-Small-Instruct-2409
parameters:
scale:
- filter: o_proj
value: 0.0
- filter: down_proj
value: 0.0
- value: 1.0
# ... REPEAT UNTIL 41
- sources:
- layer_range: [41, 55]
model: unsloth/Mistral-Small-Instruct-2409
```
```go
0 = original
X = duplicate
Previous Technique = 000000000000000000XXXXXXXXXXXXXXXXXXX0000000000
Proposed Technique = 00000000000XX0XX0XX0XX0XX0XX0XX0XX0XX0000000000
``` |