arxiv:2401.04081

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Published on Jan 8

· Submitted by

akhaliq on Jan 9

#2 Paper of the day

Authors:

Maciej Pióro ,

,

,

,

Abstract

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

View arXiv page View PDF Add to collection

Community

patruff

Jan 9

Well done!

Jan 9

This is so exciting!! I was hoping someone would try this out. Kudos!!

Jan 9

Could be interesting to see if it works better with the fast feed forward

venkycs

Jan 13

While transformer architecture promotes Generations, I felt Mamba does compression.

Jan 20

Is there any implementation of this I could find?

Jun 9

MoE-Mamba: Revolutionizing Language Models with Efficiency and Scalability

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.04081 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.04081 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.04081 in a Space README.md to link it from this page.

Collections including this paper 29