Text Generation
Transformers
PyTorch
TeleFLM
custom_code
jasonfang3900 commited on
Commit
d775c28
1 Parent(s): 6cc3a53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Tele-FLM
6
+ Tele-FLM-1T (aka FLM-2-1T) is a 1T open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgement capabilities.
7
+ Built upon the decoder-only transformer architecture, it has been trained on approximately 2T tokens.
8
+ Tele-FLM series demonstrate superior performances at its scale, and sometimes surpass larger models.
9
+ In addition to sharing the model weights, we provide the core designs, engineering practices, and training details, anticipating their benefits for both academic and industrial communities.
10
+
11
+ ## Model Details
12
+
13
+ - **Developed by:** BAAI & TeleAI
14
+ - **Language(s):** English; Chinese; Other languages
15
+ - **License:** Apache 2.0
16
+
17
+ ## Technical Report
18
+
19
+ [52B to 1T: Lessons Learned via Tele-FLM Series](https://arxiv.org/pdf/2407.02783)
20
+ [Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
21
+
22
+ ## Bias, Risks, and Limitations
23
+
24
+ Although we've made extensive efforts to thoroughly clean and filter the training corpus for the model, due to the open nature of the dataset, the model may still have picked up on some unsafe examples. Consequently, the model may still generate unexpected content, including but not limited to discrimination, bias, or offensive language. We would like to strongly advise users not to spread any unsafe content generated by the model. The project developers cannot be held responsible for any repercussions stemming from the dissemination of harmful information.
25
+
26
+
27
+
28
+ ## Training Details
29
+
30
+ ### Model Architecture
31
+ Based on growth technology, the Tele-FLM-1T model training is divided into three stages by parameter size: 52B, 102B, and 1TB. Each stage of the model uses the same backbone structure. Tele-FLM models utilize the standard GPT-style decoder-only transformer architecture with a few adjustments:
32
+ - Rotary Positional Embedding (RoPE)
33
+ - RMSNorm for normalization
34
+ - SwiGLU for activation function
35
+ - Linear bias disabled
36
+ - Embedding and language model head untied
37
+ - Input and output multiplication
38
+
39
+ Consequently, Tele-FLM-1T is largely compatible with Llama architecturally.
40
+ To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
41
+
42
+
43
+ | Models | layer<br>number | attention<br>heads | hidden<br>size | ffn hidden<br>size | vocab<br>size | context<br>length | params<br>count |
44
+ | ------------- | --------------- | ------------------ | -------------- | ------------------ | ------------- | ----------------- | --------------- |
45
+ | Tele-FLM-52B | 64 | 64 | 8,192 | 21,824 | 80,000 | 4,096 | 52.85 B |
46
+ | Tele-FLM-102B | 80 | 80 | 10,240 | 27,264 | 80,000 | 4,096 | 102.3 B |
47
+ | Tele-FLM-1T | 140 | 160 | 20,480 | 98,304 | 80,000 | 4,096 | 1,083.74 B |
48
+
49
+
50
+ ### Hardware
51
+
52
+ Tele-FLM-1T is trained on a cluster of 112 A800 SXM4 GPU servers, each with 8 NVLink A800 GPUs and 2TB of RAM.
53
+ The nodes have varied CPU configurations: 96 nodes with Intel 8358 (128x 2.60GHz) CPUs and 16 nodes with AMD 7643 (96x 2.30GHz) CPUs.
54
+ All nodes are interconnected via InfiniBand (IB). The training process lasted around two months, including downtime due to unexpected factors.
55
+
56
+ ### Software
57
+
58
+ Tele-FLM utilizes 3D parallel training, combining the prevailing methodologies: data parallelism, tensor parallelism, and pipeline parallelism.
59
+ The parallel training setup for Tele-FLM is configured as follows: tensor parallel=32, pipeline parallel=28, and data parallel=1.
60
+
61
+ ### Relate Work
62
+ [Tele-FLM (52B)](https://huggingface.co/CofeAI/Tele-FLM)
63
+ [FLM-101B](https://huggingface.co/CofeAI/FLM-101B)
64
+
65
+ ## Citation
66
+ If you find our work helpful, please consider citing it.
67
+ ```
68
+ @misc{li202452b,
69
+ title={52B to 1T: Lessons Learned via Tele-FLM Series},
70
+ author={Li, Xiang and Yao, Yiqun and Jiang, Xin and Fang, Xuezhi and Wang, Chao and Liu, Xinzhang and Wang, Zihan and Zhao, Yu and Wang, Xin and Huang, Yuyao and others},
71
+ year={2024},
72
+ eprint={2407.02783},
73
+ archivePrefix={arXiv},
74
+ primaryClass={cs.CL}
75
+ }
76
+
77
+ @misc{li2024teleflm,
78
+ title={Tele-FLM Technical Report},
79
+ author={Xiang Li and Yiqun Yao and Xin Jiang and Xuezhi Fang and Chao Wang and Xinzhang Liu and Zihan Wang and Yu Zhao and Xin Wang and Yuyao Huang and Shuangyong Song and Yongxiang Li and Zheng Zhang and Bo Zhao and Aixin Sun and Yequan Wang and Zhongjiang He and Zhongyuan Wang and Xuelong Li and Tiejun Huang},
80
+ year={2024},
81
+ eprint={2404.16645},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.CL}
84
+ }
85
+ ```