DUS procedure
#4
by
meigami
- opened
I am very interested how did you implement the DUS procedure
Any information or guidance you could provide would be invaluable for me
@meigami sure, from my tests best results are when in every step you remove 1/4 of total layers.
- create two copies of a model
- remove top 1/4 layers from first and 1/4 bottom layers from second
- merge
- SFT to recover after merge
mergekit file for GALAXY, base model is SOLAR (1/4 of total layers is 12 layers)
dtype: bfloat16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 36]
model: "upstage_SOLAR-10.7B-v1.0"
- sources:
- layer_range: [12, 48]
model: "upstage_SOLAR-10.7B-v1.0"
having GALAXY same steps were applied to create NEBULA (1/4 of total layers is now 18 layers)
dtype: bfloat16
merge_method: passthrough
slices:
- sources:
- layer_range: [0, 54]
model: "GALAXY-XB-v0.3"
- sources:
- layer_range: [18, 72]
model: "GALAXY-XB-v0.3"
SFT training data should be similar to original training data (I used subset of SOLAR training data).