File size: 5,774 Bytes
64300ed 1d2dc7d 64300ed 1d2dc7d 7397626 d050b3f 1d2dc7d 6da2969 216d5f6 41e3a67 1d2dc7d 144fe25 1d2dc7d 41e3a67 1d2dc7d 0226389 1d2dc7d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
tags:
- ' TransNormerLLM'
---
<div align="center">
<h1>
TransNormerLLM3 -- A Faster and Better LLM
</h1>
</div>
# Introduction
This official repository unveils the TransNormerLLM3 model along with its open-source weights for every 50 billion tokens processed during pre-training.
[TransNormerLLM](https://arxiv.org/abs/2307.14995) evolving from [TransNormer](https://arxiv.org/abs/2210.10340), standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance.
# TransNormerLLM3
- **TransNormerLLM3-15B** features **14.83 billion** parameters. It is structured with **42 layers**, includes **40 attention heads**, and has a total **embedding size of 5120**.
- **TransNormerLLM3-15B** is purely intergrated with **[Lightning Attention-2](http://arxiv.org/abs/2401.04658)**, which can maintain a **stable TGS** during training of **unlimited sequence lengths**, up until encountering firm limitations like GPU memory constraints.
- **Titoken** tokenizer is used with a total **vocabulary size** of about **100,000**.
- It incorporates **Simple GLU** for its channel mixer, **GLA** in the token mixer, and **SRMSNorm** for normalization.
- In terms of position encoding, the first layer employs **LRPE with exponential decay**, whereas the subsequent layers continue with **exponential decay encoding**.
### Pre-training Logbook
* Realtime Track: https://api.wandb.ai/links/opennlplab/kip314lq
* Join to dicussion: [discord](https://discord.gg/MYQh6BWN) <<<>>> [wechat group](https://github.com/OpenNLPLab/TransnormerLLM/blob/main/images/contact_me_qr.png)
> --23.12.25-- startup: [WeChat - 预训练启航](https://mp.weixin.qq.com/s/YjUY-uy89WkF75_-rBTuKw) <<<>>> [Twitter - Pre-training Commences ](https://twitter.com/opennlplab/status/1739568669502611825) <<<>>> [YouTube Recording](https://t.co/wk7svS4o5r) <<<>>> [bilibili 回放](https://www.bilibili.com/video/BV11j411J7Dy)
> --24.01.02-- first week review: [WeChat - 第一周概览](https://mp.weixin.qq.com/s/zwGnZZI3itNPoxzzXkuU2w) <<<>>> [Twitter - First Week Review](https://twitter.com/opennlplab/status/1742187694078501038)
> --24.01.09-- second week review: [WeChat - 第二周概览](https://mp.weixin.qq.com/s/6D0qi-0aBier05OKuHfPEA) <<<>>> [Twitter - Second Week Review](https://twitter.com/opennlplab/status/1744720007299523063)
# Released Weights
| param | token | Hugging Face | Model Scope | Wisemodel |
| :-----: | :---: | :------------------------------------------------------------------------------------------------------------------: | :---------: | :-------: |
| **15B** | 50B | 🤗[step13000](ttps://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step13000-50Btokens) | 🤖 | 🐯 |
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", revision='step13000-50Btokens', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", torch_dtype=torch.bfloat16,revision='step13000-50Btokens', device_map="auto", trust_remote_code=True)
```
# Benchmark Results
The evaluations of all models are conducted using the official settings and the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework.
| Model | P | T | BoolQ | PIQA | HS | WG | ARC-e | ARC-c | OBQA | MMLU | C-Eval |
| ----------------------- | --- | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ |
| **TransNormerLLM3-15B** | 15 | 0.05 | 62.08 | 72.52 | 55.55 | 57.14 | 62.12 | 31.14 | 32.40 | 27.50 | 26.18 |
| **TransNormerLLM3-15B** | 15 | 0.10 | 63.98 | 74.70 | 61.09 | 61.33 | 65.95 | 34.64 | 35.60 | 25.38 | 27.40 |
| **TransNormerLLM3-15B** | 15 | 0.15 | 60.34 | 75.08 | 63.99 | 62.04 | 64.56 | 34.90 | 35.20 | 22.64 | 26.60 |
> **P**: parameter size (billion). **T**: tokens (trillion). **BoolQ**: acc. **PIQA**: acc. **HellaSwag**: acc_norm. **WinoGrande**: acc. **ARC-easy**: acc. **ARC-challenge**: acc_norm. **OpenBookQA**: acc_norm. **MMLU**: 5-shot acc. **C-Eval**: 5-shot acc.
# Acknowledgments and Citation
## Acknowledgments
Our project is developed based on the following open source projects:
- [tiktoken](https://github.com/openai/tiktoken) for the tokenizer.
- [metaseq](https://github.com/facebookresearch/metaseq) for training.
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluation.
## Citation
If you wish to cite our work, please use the following reference:
```
@article{qin2023scaling,
title={Scaling transnormer to 175 billion parameters},
author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others},
journal={arXiv preprint arXiv:2307.14995},
year={2023}
}
@misc{qin2024lightning,
title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models},
author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong},
year={2024},
eprint={2401.04658},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<p align="center">
- OpenNLPLab @2024 -
</p> |