BeaverAI
/

Tunguska-39B-v1b-GGUF

Inference Endpoints

Model card Files Files and versions Community

TheDrummer commited on 7 days ago

Commit

3cd2fbd

•

1 Parent(s): 2168604

Update README.md

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -110,8 +110,9 @@ WIP
 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
 - The duplicated layers in `o_proj` are less sensitive for some reason.
-# [Eureka?] Top-to-Bottom Linear Gradient Observations
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
 - Take note of a few things
   - Top layers = Ending layers (nearer to output)
   - Bottom layers = Starting layers (nearer to input)
@@ -120,7 +121,11 @@ WIP
     - The duplicated slices EACH have their own gradient.
     - There's a 'ceiling value' for each of these duplicated slices.
     - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
-- Takeaway? Saturating these duplicated layers MIGHT be a good goal to pursue.
 # Further Experimentation
 Given how the duplicated layers seem to have a stabilizing effect, it begs the question:

 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
 - The duplicated layers in `o_proj` are less sensitive for some reason.
+# Eureka?
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
+**Top-to-Bottom Linear Gradient Observations**
 - Take note of a few things
   - Top layers = Ending layers (nearer to output)
   - Bottom layers = Starting layers (nearer to input)
     - The duplicated slices EACH have their own gradient.
     - There's a 'ceiling value' for each of these duplicated slices.
     - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
+- Takeaways
+  - These slice of layers are more connected to each other than to the model's entirety.
+    - [Question] Does this mean that the **original layer** before the slice is the one holding that whole slice together?
+    - [Question] What if we interleave original and duplicate layers together? Will that result in a more balanced upscale?
+  - Saturating these duplicated layers MIGHT be a good goal to pursue.
 # Further Experimentation
 Given how the duplicated layers seem to have a stabilizing effect, it begs the question: