|
--- |
|
license: apache-2.0 |
|
--- |
|
# MoH: Multi-Head Attention as Mixture-of-Head Attention |
|
|
|
**Paper or resources for more information:** |
|
[[Paper](https://huggingface.co/papers/2410.11842)] [[Code](https://github.com/SkyworkAI/MoH)] |
|
|
|
## โก Overview |
|
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: |
|
* First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. |
|
* Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. |
|
|
|
|
|
|
|
## ๐ฎ Highlights |
|
### ๐ก General Framework |
|
We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks. |
|
|
|
<div align=center> |
|
|
|
| Code | HuggingFace Model | |
|
|:-----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| |
|
| **[MoH-ViT](https://github.com/SkyworkAI/MoH/tree/main/MoH-ViT)** | ๐ค [MoH-ViT-B-75](https://huggingface.co/Chat-UniVi/MoH-ViT-B-75), [MoH-ViT-B-50](https://huggingface.co/Chat-UniVi/MoH-ViT-B-50), [MoH-ViT-S-80](https://huggingface.co/Chat-UniVi/MoH-ViT-S-80), [MoH-ViT-S-75](https://huggingface.co/Chat-UniVi/MoH-ViT-S-75) | |
|
| **[MoH-DiT](https://github.com/SkyworkAI/MoH/tree/main/MoH-DiT)** | ๐ [MoH-DiT-90](https://huggingface.co/Chat-UniVi/MoH-DiT-XL-90) | |
|
| **[MoH-LLaMA3-8B](https://github.com/SkyworkAI/MoH/tree/main/MoH-LLaMA3)** | ๐ [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B) | |
|
|
|
</div> |
|
|
|
### ๐ฅ High Performance |
|
Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only **50%~90%** of the attention heads. |
|
|
|
### ๐ค Support Continue-Tuning Starting from the Multi-Head Attention Models |
|
we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. |
|
|
|
|
|
The MoH model quickly recovers to over **95%** of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens. |
|
|
|
|
|
## โ๏ธ Citation |
|
If you find this paper useful, please consider staring ๐ this repo and citing ๐ our paper: |
|
``` |
|
@article{jin2024moh, |
|
title={MoH: Multi-Head Attention as Mixture-of-Head Attention}, |
|
author={Peng Jin and Bo Zhu and Li Yuan and Shuicheng Yan}, |
|
year={2024} |
|
} |
|
``` |