|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: fill-mask |
|
inference: false |
|
--- |
|
|
|
# Monarch Mixer-BERT |
|
|
|
An 80M checkpoint of M2-BERT, pretrained with sequence length 32768. |
|
**This is a BERT-style model that has not been fine-tuned. We recommend fine-tuning it for specific use cases before using it.** |
|
|
|
Check out the paper [Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture](https://arxiv.org/abs/2310.12109) and our [blog post]() on retrieval for more on how we trained this model for long sequence. |
|
|
|
This model was trained by Jon Saad-Falcon, Dan Fu, and Simran Arora. |
|
|
|
Check out our [GitHub](https://github.com/HazyResearch/m2/tree/main) for instructions on how to download and fine-tune it! |
|
|
|
## How to use |
|
|
|
You can load this model using Hugging Face `AutoModel`: |
|
```python |
|
from transformers import AutoModelForMaskedLM |
|
model = AutoModelForMaskedLM.from_pretrained( |
|
"togethercomputer/m2-bert-80M-32k-retrieval", |
|
trust_remote_code=True |
|
) |
|
``` |
|
|
|
You should expect to see a large error message about unused parameters for FlashFFTConv. |
|
If you'd like to load the model with FlashFFTConv, you can check out our [GitHub](https://github.com/HazyResearch/m2/tree/main). |
|
|
|
## Acknowledgments |
|
|
|
Alycia Lee helped with AutoModel support. |
|
|
|
## Citation |
|
|
|
If you use this model, or otherwise found our work valuable, you can cite us as follows: |
|
``` |
|
@inproceedings{fu2023monarch, |
|
title={Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture}, |
|
author={Fu, Daniel Y and Arora, Simran and Grogan, Jessica and Johnson, Isys and Eyuboglu, Sabri and Thomas, Armin W and Spector, Benjamin and Poli, Michael and Rudra, Atri and R{\'e}, Christopher}, |
|
booktitle={Advances in Neural Information Processing Systems}, |
|
year={2023} |
|
} |
|
``` |
|
|