VAGOsolutions
/

SauerkrautLM-1.5b

Text Generation

continuous pretraining

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

DavidGF commited on Jun 12

Commit

9294a74

•

1 Parent(s): 26d5d6c

Update README.md

Files changed (1) hide show

README.md +6 -3

README.md CHANGED Viewed

@@ -49,7 +49,8 @@ Introducing **SauerkrautLM-1.5b** – our Sauerkraut version of the powerful [Qw
 ## Training Procedure
 This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
-**Continuous Pre-training (CPT) on German Data**: Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
 Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
 Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
 Using Spectrum, we enhanced the German language capabilities of the Qwen2-1.5B model via CPT while achieving substantial resource savings.
@@ -60,8 +61,10 @@ In the German Rag evaluation, it is on par with 8 billion parameter models and,
 Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
-**Post-CPT Training**: The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
-**Further Steps**: The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
 ## Objective and Results

 ## Training Procedure
 This model is a demo intended to showcase the potential of resource-efficient training of large language models using Spectrum CPT. Here's a brief on the procedure:
+**Continuous Pre-training (CPT) on German Data**:
+Utilizing Spectrum by Eric Hartford, Lucas Atkins, Fernando Fernandes Neto, and David Golchinfar, the model targeted 25% of its layers during training. This approach allowed significant resource savings:
 Spectrum with 25% layer targeting consumed 309.78GB at a batch size of 2048.
 Full Fine-tuning targeting 100% of layers used 633.55GB at the same batch size.
 Using Spectrum, we enhanced the German language capabilities of the Qwen2-1.5B model via CPT while achieving substantial resource savings.
 Despite the large volume of German CPT data, the model competes well against the Qwen2-1.5B-Instruct model and performs significantly better in German.
+**Post-CPT Training**:
+The model underwent 3 epochs of Supervised Fine-Tuning (SFT) with 700K samples.
+**Further Steps**:
+The model was aligned with Direct Preference Optimization (DPO) using 70K samples.
 ## Objective and Results