NOTE: THIS MODEL IS NOT INTEGRATED WITH HUGGING FACE. Please use the version of this model converted to the newly implemented Mega
architecture in transformers
(link)
Moving Average Gated Attention (Mega): Pretrained LM
This repo contains pretrained weights for a language model with the Mega architecture (see paper).
I used the Mega source code (namely the MegaEncoderLayer
class) and created wrappers for token embeddings and MLM prediction. This model
was pretrained for 5 epochs (11.3k gradient steps) on wikitext-103, which took roughly 5 hours on a single T4 (in Colab's free tier).
See the Colab notebook for further training details. In order to load the pretrained weights for this model, you'll need to use the Mega repo along with the example code at the end of the Colab notebook.