File size: 6,834 Bytes
64300ed
 
1d2dc7d
 
 
 
 
 
64300ed
1d2dc7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7397626
d050b3f
590b4b9
8499a6e
590b4b9
1d2dc7d
 
 
9d4a4d9
6da2969
 
216d5f6
41e3a67
7af1936
1d2dc7d
 
 
 
4a8c07f
 
 
 
7af1936
 
144fe25
 
 
 
 
8fd5614
 
144fe25
1d2dc7d
 
 
 
7af1936
 
 
 
 
 
 
 
1d2dc7d
 
 
7af1936
 
 
 
 
 
1d2dc7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0226389
 
 
 
 
 
 
 
 
1d2dc7d
 
 
8499a6e
1d2dc7d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: text-generation
tags:
- ' TransNormerLLM'
---

<div align="center">
<h1>
  TransNormerLLM3 -- A Faster and Better LLM
</h1>
</div>

# Introduction

This official repository unveils the TransNormerLLM3 model along with its open-source weights for every 50 billion tokens processed during pre-training.

[TransNormerLLM](https://arxiv.org/abs/2307.14995) evolving from [TransNormer](https://arxiv.org/abs/2210.10340), standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance.


# TransNormerLLM3
- **TransNormerLLM3-15B** features **14.83 billion** parameters. It is structured with **42 layers**, includes **40 attention heads**, and has a total **embedding size of 5120**.
- **TransNormerLLM3-15B** is purely intergrated with **[Lightning Attention-2](http://arxiv.org/abs/2401.04658)**, which can maintain a **stable TGS** during training of **unlimited sequence lengths**, up until encountering firm limitations like GPU memory constraints.   
- **Titoken** tokenizer is used with a total **vocabulary size** of about **100,000**. 
<p align="center">
  <img src="./images/TransNormer3.jpg" width="65%" />
</p>

### Pre-training Logbook
* Realtime Track: https://api.wandb.ai/links/opennlplab/kip314lq  
* Join to dicussion: [discord](https://discord.gg/JEU3nTcWKC) <<<>>> [wechat group](https://github.com/OpenNLPLab/TransnormerLLM/blob/main/images/contact_me_qr.png)
  
> --23.12.25-- startup: [WeChat - 预训练启航](https://mp.weixin.qq.com/s/YjUY-uy89WkF75_-rBTuKw)  <<<>>>  [Twitter - Pre-training Commences ](https://twitter.com/opennlplab/status/1739568669502611825) <<<>>> [YouTube Recording](https://t.co/wk7svS4o5r)   <<<>>> [bilibili 回放](https://www.bilibili.com/video/BV11j411J7Dy)  
> --24.01.02-- first week review: [WeChat - 第一周概览](https://mp.weixin.qq.com/s/zwGnZZI3itNPoxzzXkuU2w) <<<>>>   [Twitter - First Week Review](https://twitter.com/opennlplab/status/1742187694078501038)  
> --24.01.09-- second week review: [WeChat - 第二周概览](https://mp.weixin.qq.com/s/6D0qi-0aBier05OKuHfPEA) <<<>>>   [Twitter - Second Week Review](https://twitter.com/opennlplab/status/1744720007299523063)  
> --24.01.15-- third week review: [WeChat - 第三周概览](https://mp.weixin.qq.com/s/EQg8evZ2cNtAk4HruwCXPA) <<<>>>   [Twitter - Third Week Review](https://twitter.com/opennlplab/status/1746920293069910190)  


# Released Weights

|  param  | token |                                                      Hugging Face                                                      | Model Scope | Wisemodel |
| :-----: | :---: | :--------------------------------------------------------------------------------------------------------------------: | :---------: | :-------: |
| **15B** |  50B  | 🤗[step13000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step13000-50Btokens)  |      🤖      |     🐯     |
| **15B** | 100B  | 🤗[step26000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step26000-100Btokens) |      🤖      |     🐯     |
| **15B** | 150B  | 🤗[step39000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step39000-150Btokens) |      🤖      |     🐯     |
| **15B** | 200B  | 🤗[step52000](https://huggingface.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints/tree/step52000-200Btokens) |      🤖      |     🐯     |


```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", revision='step26000-100Btokens', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", torch_dtype=torch.bfloat16, revision='step26000-100Btokens', device_map="auto", trust_remote_code=True)
```

# Benchmark Results
The evaluations of all models are conducted using the official settings and the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework.

| Model                   | P   | T    | BoolQ | PIQA  | HS    | WG    | ARC-e | ARC-c | OBQA  | C-Eval | MMLU  |
| ----------------------- | --- | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ----- |
| **TransNormerLLM3-15B** | 15  | 0.05 | 62.08 | 72.52 | 55.55 | 57.14 | 62.12 | 31.14 | 32.40 | 26.18  | 27.50 |
| **TransNormerLLM3-15B** | 15  | 0.10 | 63.98 | 74.70 | 61.09 | 61.33 | 65.95 | 34.64 | 35.60 | 25.38  | 27.40 |
| **TransNormerLLM3-15B** | 15  | 0.15 | 60.34 | 75.08 | 63.99 | 62.04 | 64.56 | 34.90 | 35.20 | 22.64  | 26.60 |
| **TransNormerLLM3-15B** | 15  | 0.20 | 52.05 | 74.48 | 64.72 | 62.75 | 66.16 | 35.15 | 36.80 | 27.25  | 30.80 |
| **TransNormerLLM3-15B** | 15  | 0.25 | 66.70 | 76.50 | 66.51 | 64.80 | 66.84 | 36.18 | 39.40 | 30.87  | 36.10 |
| **TransNormerLLM3-15B** | 15  | 0.30 | 67.00 | 76.50 | 67.17 | 64.40 | 66.29 | 36.77 | 38.80 | 33.99  | 37.60 |

> **P**: parameter size (billion). **T**: tokens (trillion). **BoolQ**: acc. **PIQA**: acc. **HellaSwag**: acc_norm. **WinoGrande**: acc. **ARC-easy**: acc. **ARC-challenge**: acc_norm. **OpenBookQA**: acc_norm. **MMLU**: 5-shot acc. **C-Eval**: 5-shot acc.

```bash
# Please configure the following settings when do evaluation
export do_eval=True
export use_triton=False
```


# Acknowledgments and Citation

## Acknowledgments
Our project is developed based on the following open source projects:
- [tiktoken](https://github.com/openai/tiktoken) for the tokenizer.
- [metaseq](https://github.com/facebookresearch/metaseq) for training.
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluation.


## Citation
If you wish to cite our work, please use the following reference:
```
@article{qin2023scaling,
  title={Scaling transnormer to 175 billion parameters},
  author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others},
  journal={arXiv preprint arXiv:2307.14995},
  year={2023}
}

@misc{qin2024lightning,
      title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models}, 
      author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong},
      year={2024},
      eprint={2401.04658},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

<p align="center">
  <img src="./images/lightning3-leopard.jpg" width="50%" />
  - OpenNLPLab @2024 -
</p>