license: apache-2.0
MoH: Multi-Head Attention as Mixture-of-Head Attention
Paper or resources for more information: [Paper] [Code]
โก Overview
We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:
- First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
- Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential.
๐ฎ Highlights
๐ก General Framework
We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.
Code | HuggingFace Model |
---|---|
MoH-ViT | ๐ค MoH-ViT-B-75, MoH-ViT-B-50, MoH-ViT-S-80, MoH-ViT-S-75 |
MoH-DiT | ๐ MoH-DiT-90 |
MoH-LLaMA3-8B | ๐ MoH-LLaMA3-8B |
๐ฅ High Performance
Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%~90% of the attention heads.
๐ค Support Continue-Tuning Starting from the Multi-Head Attention Models
we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.
The MoH model quickly recovers to over 95% of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.
โ๏ธ Citation
If you find this paper useful, please consider staring ๐ this repo and citing ๐ our paper:
@article{jin2024moh,
title={MoH: Multi-Head Attention as Mixture-of-Head Attention},
author={Peng Jin and Bo Zhu and Li Yuan and Shuicheng Yan},
year={2024}
}