|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- moe |
|
- moah |
|
- mod |
|
- mh-moe |
|
datasets: |
|
- Locutusque/UltraTextbooks |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
MoM: Mixture of Mixture |
|
|
|
This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with bf16 bits linear layers, mixture of attention head and **multi head** mixture of depth. |
|
|
|
The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. |
|
|
|
|
|
- **Model type:** Mixture of attention head mixture of depth and mixture of expert bf16 linear layers |
|
- **License:** Apache licence 2.0 |
|
|
|
### Model Sources [optional] |
|
|
|
|
|
- **Repository:** https://github.com/ostix360/optimized-LLM |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
This model has a generation problem because of a softmax application in the mod process |
|
|
|
|
|
If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/1f937b3c35074c9eb48ccde52677bb0439f71960) |
|
|
|
|
|
## Training Details |
|
|
|
- **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/ygwwa30r) |
|
|
|
### Training Data |
|
|
|
We use the first ~0.5B tokens of Locutusque/UltraTextbooks to train this model |
|
|
|
### Training Procedure |
|
|
|
We use adam-8 bits with default betas and epsilon values |
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
The data fit the model max length i.e. 512 tokens |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
Please look at the wandb metadata to see the hyperparameters or the train.py file in the repo |
|
|
|
|
|
## Technical Specifications |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
- one 4070 ti GPU |
|
|
|
#### Software |
|
|
|
- pytorch, transformers etc |
|
|