Chat-UniVi
/

MoH-DiT-XL-90

Model card Files Files and versions Community

MoH-DiT-XL-90 / README.md

Chat-UniVi's picture

Update README.md

88a246f verified 15 days ago

|

history blame contribute delete

3.88 kB

	---
	license: apache-2.0
	---
	# MoH: Multi-Head Attention as Mixture-of-Head Attention

	Paper or resources for more information:
	[[Paper](https://huggingface.co/papers/2410.11842)] [[Code](https://github.com/SkyworkAI/MoH)]

	## ⚡ Overview
	We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:
	* First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
	* Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential.



	## 😮 Highlights
	### 💡 General Framework
	We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.

	<div align=center>

	\| Code \| HuggingFace Model \|
	\|:-----------------------------------------:\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|
	\| [MoH-ViT](https://github.com/SkyworkAI/MoH/tree/main/MoH-ViT) \| 🤗 [MoH-ViT-B-75](https://huggingface.co/Chat-UniVi/MoH-ViT-B-75), [MoH-ViT-B-50](https://huggingface.co/Chat-UniVi/MoH-ViT-B-50), [MoH-ViT-S-80](https://huggingface.co/Chat-UniVi/MoH-ViT-S-80), [MoH-ViT-S-75](https://huggingface.co/Chat-UniVi/MoH-ViT-S-75) \|
	\| [MoH-DiT](https://github.com/SkyworkAI/MoH/tree/main/MoH-DiT) \| 😊 [MoH-DiT-90](https://huggingface.co/Chat-UniVi/MoH-DiT-XL-90) \|
	\| [MoH-LLaMA3-8B](https://github.com/SkyworkAI/MoH/tree/main/MoH-LLaMA3) \| 😊 [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B) \|

	</div>

	### 🔥 High Performance
	Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%~90% of the attention heads.

	### 🤗 Support Continue-Tuning Starting from the Multi-Head Attention Models
	we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.


	The MoH model quickly recovers to over 95% of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.


	## ✏️ Citation
	If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
	```
	@article{jin2024moh,
	title={MoH: Multi-Head Attention as Mixture-of-Head Attention},
	author={Peng Jin and Bo Zhu and Li Yuan and Shuicheng Yan},
	year={2024}
	}
	```