DUS procedure

#4
by meigami - opened

I am very interested how did you implement the DUS procedure
Any information or guidance you could provide would be invaluable for me

@meigami sure, from my tests best results are when in every step you remove 1/4 of total layers.

  1. create two copies of a model
  2. remove top 1/4 layers from first and 1/4 bottom layers from second
  3. merge
  4. SFT to recover after merge

mergekit file for GALAXY, base model is SOLAR (1/4 of total layers is 12 layers)

dtype: bfloat16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 36]
    model: "upstage_SOLAR-10.7B-v1.0"
- sources:
  - layer_range: [12, 48]
    model: "upstage_SOLAR-10.7B-v1.0"

having GALAXY same steps were applied to create NEBULA (1/4 of total layers is now 18 layers)

dtype: bfloat16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 54]
    model: "GALAXY-XB-v0.3"
- sources:
  - layer_range: [18, 72]
    model: "GALAXY-XB-v0.3"

SFT training data should be similar to original training data (I used subset of SOLAR training data).

Sign up or log in to comment