|
--- |
|
library_name: transformers |
|
tags: |
|
- moe |
|
- moah |
|
- mod |
|
license: apache-2.0 |
|
datasets: |
|
- Locutusque/UltraTextbooks |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
MoM: Mixture of Mixture |
|
|
|
This Model is a test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with 1.58 bits linear layers **excpted for attention layer**, mixture of attention head and mixture of depth. |
|
|
|
The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. |
|
|
|
Only 17.8M parameter over 1025 is in bf16 precision wich is ~ 1.7% of the total number of parameters |
|
|
|
|
|
- **Model type:** Mixture of attention head mixture of depth and mixture of expert 1.58bit linear layers **excepted for attention layer** |
|
- **License:** Apache licence 2.0 |
|
|
|
### Model Sources [optional] |
|
|
|
|
|
- **Repository:** https://github.com/ostix360/optimized-LLM |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/04cae61fb252a5927756c86ec0efde32d0dd3794) |
|
|
|
|
|
## Training Details |
|
|
|
- **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/68hieuwt) |
|
|
|
### Training Data |
|
|
|
We use the first 100k data of Locutusque/UltraTextbooks to train this model |
|
|
|
### Training Procedure |
|
|
|
We use adam-8 bits with default betas and epsilon values |
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
The data fit the model max length i.e. 512 tokens |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
Please look at the wandb metadata file or the train.py file in the repo to see the hyperparameters |
|
|
|
|
|
## Technical Specifications [optional] |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
- one 4070 ti GPU |
|
|
|
#### Software |
|
|
|
- pytorch, transformers etc |
|
|