TheDrummer commited on
Commit
3cd2fbd
1 Parent(s): 2168604

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -110,8 +110,9 @@ WIP
110
  - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
111
  - The duplicated layers in `o_proj` are less sensitive for some reason.
112
 
113
- # [Eureka?] Top-to-Bottom Linear Gradient Observations
114
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
 
115
  - Take note of a few things
116
  - Top layers = Ending layers (nearer to output)
117
  - Bottom layers = Starting layers (nearer to input)
@@ -120,7 +121,11 @@ WIP
120
  - The duplicated slices EACH have their own gradient.
121
  - There's a 'ceiling value' for each of these duplicated slices.
122
  - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
123
- - Takeaway? Saturating these duplicated layers MIGHT be a good goal to pursue.
 
 
 
 
124
 
125
  # Further Experimentation
126
  Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
 
110
  - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
111
  - The duplicated layers in `o_proj` are less sensitive for some reason.
112
 
113
+ # Eureka?
114
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
115
+ **Top-to-Bottom Linear Gradient Observations**
116
  - Take note of a few things
117
  - Top layers = Ending layers (nearer to output)
118
  - Bottom layers = Starting layers (nearer to input)
 
121
  - The duplicated slices EACH have their own gradient.
122
  - There's a 'ceiling value' for each of these duplicated slices.
123
  - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
124
+ - Takeaways
125
+ - These slice of layers are more connected to each other than to the model's entirety.
126
+ - [Question] Does this mean that the **original layer** before the slice is the one holding that whole slice together?
127
+ - [Question] What if we interleave original and duplicate layers together? Will that result in a more balanced upscale?
128
+ - Saturating these duplicated layers MIGHT be a good goal to pursue.
129
 
130
  # Further Experimentation
131
  Given how the duplicated layers seem to have a stabilizing effect, it begs the question: