TheDrummer
commited on
Commit
•
3cd2fbd
1
Parent(s):
2168604
Update README.md
Browse files
README.md
CHANGED
@@ -110,8 +110,9 @@ WIP
|
|
110 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
111 |
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
112 |
|
113 |
-
#
|
114 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
|
|
|
115 |
- Take note of a few things
|
116 |
- Top layers = Ending layers (nearer to output)
|
117 |
- Bottom layers = Starting layers (nearer to input)
|
@@ -120,7 +121,11 @@ WIP
|
|
120 |
- The duplicated slices EACH have their own gradient.
|
121 |
- There's a 'ceiling value' for each of these duplicated slices.
|
122 |
- Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
|
123 |
-
-
|
|
|
|
|
|
|
|
|
124 |
|
125 |
# Further Experimentation
|
126 |
Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
|
|
|
110 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
111 |
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
112 |
|
113 |
+
# Eureka?
|
114 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
|
115 |
+
**Top-to-Bottom Linear Gradient Observations**
|
116 |
- Take note of a few things
|
117 |
- Top layers = Ending layers (nearer to output)
|
118 |
- Bottom layers = Starting layers (nearer to input)
|
|
|
121 |
- The duplicated slices EACH have their own gradient.
|
122 |
- There's a 'ceiling value' for each of these duplicated slices.
|
123 |
- Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
|
124 |
+
- Takeaways
|
125 |
+
- These slice of layers are more connected to each other than to the model's entirety.
|
126 |
+
- [Question] Does this mean that the **original layer** before the slice is the one holding that whole slice together?
|
127 |
+
- [Question] What if we interleave original and duplicate layers together? Will that result in a more balanced upscale?
|
128 |
+
- Saturating these duplicated layers MIGHT be a good goal to pursue.
|
129 |
|
130 |
# Further Experimentation
|
131 |
Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
|