--- language: - en license: apache-2.0 library_name: transformers tags: - moe - moah - mod - mh-moe datasets: - Locutusque/UltraTextbooks --- # Model Card for Model ID ## Model Details ### Model Description MoM: Mixture of Mixture This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with bf16 bits linear layers, mixture of attention head and **multi head** mixture of depth. The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. - **Model type:** Mixture of attention head mixture of depth and mixture of expert bf16 linear layers - **License:** Apache licence 2.0 ### Model Sources [optional] - **Repository:** https://github.com/ostix360/optimized-LLM ## How to Get Started with the Model This model has a generation problem because of a softmax application in the mod process If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/1f937b3c35074c9eb48ccde52677bb0439f71960) ## Training Details - **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/ygwwa30r) ### Training Data We use the first ~0.5B tokens of Locutusque/UltraTextbooks to train this model ### Training Procedure We use adam-8 bits with default betas and epsilon values #### Preprocessing [optional] The data fit the model max length i.e. 512 tokens #### Training Hyperparameters Please look at the wandb metadata to see the hyperparameters or the train.py file in the repo ## Technical Specifications ### Compute Infrastructure #### Hardware - one 4070 ti GPU #### Software - pytorch, transformers etc