arcee-ai
/

SuperNova-Medius

@@ -18,17 +18,23 @@ SuperNova-Medius is designed to excel in a variety of business use cases, includ
 The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process, with the following key steps:
-1. **Logit Distillation from Llama-3.1-405B-Instruct**:
-   - We distilled the logits of Llama-3.1-405B-Instruct to Qwen2.5-14B using KL-divergence as the loss function. This allowed us to capture the nuanced distribution of Llama's outputs while adapting them to Qwen's architecture.
-2. **Logit and Hidden State Distillation from Qwen2.5-72B-Instruct**:
-   - Further distillation was performed using a combination of logit and hidden state distillation from Qwen2.5-72B-Instruct to ensure that SuperNova-Medius inherited the strong instruction-following capabilities and domain-specific knowledge of Qwen2.5.
-3. **Cross-Architecture Vocabulary Alignment**:
-   - Using `mergekit-tokensurgeon`, we aligned the vocabularies and hidden states of both teacher models, allowing for seamless integration of knowledge across the different architectures. This enabled SuperNova-Medius to effectively combine the strengths of both models.
-4. **Final Fusion and Fine-Tuning**:
-   - After aligning the vocabularies, a final fusion and fine-tuning step was conducted, using a specialized dataset from [EvolKit](https://github.com/arcee-ai/EvolKit) to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.
 ## Performance Evaluation
@@ -61,7 +67,7 @@ SuperNova-Medius is available for use under the Apache-2.0 license. For those wh
 - **Distillation Sources**: Qwen2.5-72B-Instruct, Llama-3.1-405B-Instruct
 - **Parameter Count**: 14 billion
 - **Training Dataset**: Custom instruction dataset generated with [EvolKit](https://github.com/arcee-ai/EvolKit)
-- **Distillation Technique**: Multi-architecture logit and hidden state distillation with cross-architecture vocabulary alignment.
 ## Summary

 The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process, with the following key steps:
+1. **Logit Distillation from Llama 3.1 405B**:
+   - We distilled the logits of Llama 3.1 405B using an offline approach.
+   - The top K logits for each token were stored to capture most of the probability mass while managing storage requirements.
+2. **Cross-Architecture Adaptation**:
+   - Using `mergekit-tokensurgeon`, we created a version of Qwen2.5-14B that uses the vocabulary of Llama 3.1 405B.
+   - This allowed for the use of Llama 3.1 405B logits in training the Qwen-based model.
+3. **Distillation to Qwen Architecture**:
+   - The adapted Qwen2.5-14B model was trained using the stored 405B logits as the target.
+4. **Parallel Qwen Distillation**:
+   - In a separate process, Qwen2-72B was distilled into a 14B model.
+5. **Final Fusion and Fine-Tuning**:
+   - The Llama-distilled Qwen model's vocabulary was reverted to Qwen vocabulary.
+   - After re-aligning the vocabularies, a final fusion and fine-tuning step was conducted, using a specialized dataset from [EvolKit](https://github.com/arcee-ai/EvolKit) to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.
 ## Performance Evaluation
 - **Distillation Sources**: Qwen2.5-72B-Instruct, Llama-3.1-405B-Instruct
 - **Parameter Count**: 14 billion
 - **Training Dataset**: Custom instruction dataset generated with [EvolKit](https://github.com/arcee-ai/EvolKit)
+- **Distillation Technique**: Multi-architecture offline logit distillation with cross-architecture vocabulary alignment.
 ## Summary