readme update
Browse files- README.md +124 -0
- README_zh.md +128 -0
README.md
CHANGED
@@ -1,3 +1,127 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- zh
|
5 |
+
- en
|
6 |
+
pipeline_tag: text-generation
|
7 |
---
|
8 |
+
|
9 |
+
<h4 align="center">
|
10 |
+
<p>
|
11 |
+
<b>English</b> |
|
12 |
+
<a href="https://huggingface.co/CofeAI/FLM-101B/blob/main/README_zh.md">简体中文</a> |
|
13 |
+
<p>
|
14 |
+
</h4>
|
15 |
+
|
16 |
+
|
17 |
+
# FLM-101B
|
18 |
+
|
19 |
+
FLM-101B is an open-source decoder-only LLM with 101 billion parameters. During the training process, model growth technique was employed. The model rapidly acquires knowledge on a small-scale model(16B) in the early stages of training and gradually scales up to 101B, resulting in a cost-effective 100B-scale LLM training(costing approximately $100,000).
|
20 |
+
FLM-101B supports both Chinese and English languages. It has a context window length of 2048 in training. Thanks to the use of [xPos](https://arxiv.org/pdf/2212.10554.pdf) rotary position embedding, it allows for efficient expansion of the window size during inference.
|
21 |
+
|
22 |
+
To advance the development of 100B-scale Large Language Models (LLMs), FLM-101B has now been fully open-sourced.
|
23 |
+
|
24 |
+
|
25 |
+
## Why use FLM-101B
|
26 |
+
|
27 |
+
- It's an open-source 100B scaled Chinese-English bilingual model.
|
28 |
+
- It's the largest known language model trained with xPos.
|
29 |
+
- It's the largest known language model that successfully implements μp transfer and loss prediction.
|
30 |
+
- It's the largest known language model that successfully implements progressive learning with model growth.
|
31 |
+
|
32 |
+
## Quick Started with FLM-101B
|
33 |
+
|
34 |
+
```python
|
35 |
+
import torch
|
36 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained("CofeAI/FLM-101B", trust_remote_code=True)
|
38 |
+
model = AutoModelForCausalLM.from_pretrained("CofeAI/FLM-101B", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto", trust_remote_code=True)
|
39 |
+
inputs = tokenizer('A house without books is like a body without a soul;', return_tensors='pt').to(model.device)
|
40 |
+
generated = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
|
41 |
+
print(tokenizer.decode(generated.cpu()[0], skip_special_tokens=True))
|
42 |
+
```
|
43 |
+
|
44 |
+
## Model Details
|
45 |
+
|
46 |
+
### Model Description
|
47 |
+
|
48 |
+
- **Model type:** Decoder-only language model
|
49 |
+
- **Language(s) (NLP):** zh, en
|
50 |
+
- **License:** apache-2.0
|
51 |
+
|
52 |
+
|
53 |
+
<!-- Provide a longer summary of what this model is. -->
|
54 |
+
|
55 |
+
### Model size
|
56 |
+
|
57 |
+
| Hyperparameter | Value |
|
58 |
+
|----------------|-------|
|
59 |
+
| n_parameters | 101B |
|
60 |
+
| n_layers | 80 |
|
61 |
+
| n_heads | 80 |
|
62 |
+
| d_model | 10240 |
|
63 |
+
| vocab size | 100256|
|
64 |
+
| sequence length| 2048 |
|
65 |
+
|
66 |
+
|
67 |
+
### Model Architecture and Objective
|
68 |
+
|
69 |
+
- **[Extrapolatable Position Embedding (xPos)](https://arxiv.org/pdf/2212.10554.pdf)**
|
70 |
+
- **[Flash Attention (In Training)](https://arxiv.org/pdf/2205.14135.pdf)**
|
71 |
+
- **[Model Growth](https://arxiv.org/pdf/2305.02869.pdf)**
|
72 |
+
- **[Loss Prediction](https://arxiv.org/abs/2304.06875)**
|
73 |
+
|
74 |
+
|
75 |
+
|
76 |
+
### Training Details
|
77 |
+
|
78 |
+
|
79 |
+
#### Training Hyper Parameter
|
80 |
+
| **Hyperparameter** | **16b** |**51b** |**101b** |
|
81 |
+
|--------------------|-------------------|------------------|-------------------|
|
82 |
+
| Optimizer | | AdamW | |
|
83 |
+
| Precision | | bfloat16 | |
|
84 |
+
| Weight decay | | 0.1 | |
|
85 |
+
| Gradient clipping | | 1.0 | |
|
86 |
+
| Learning rate | 4e-4 | 3.4e-4 | 2e-4 |
|
87 |
+
| Batch size(M tokens)|4.72 | 4.72 | 4.31 |
|
88 |
+
| Warmup(M samples) | 4.61 | 0.23 | 0.23 |
|
89 |
+
| Time(day) | 9.63 | 5.37 | 6.54 |
|
90 |
+
| Tokens(B) | 245.37 | 39.64 | 26.54 |
|
91 |
+
|
92 |
+
|
93 |
+
#### Parallel Strategies
|
94 |
+
|
95 |
+
| **Params(billion)** | **TP Size** | **PP Size** | **DP Size** | **Number of GPUs** | **Batch Size** | **TFLOP/s per GPU** | **GPU Utilization** |
|
96 |
+
|---------------------|--------------|--------------|--------------|---------------------|-----------------|----------------------|----------------------|
|
97 |
+
| 16 | 2 | 1 | 96 | 192 | 2304 | 162 | 51.90% |
|
98 |
+
| 51 | 4 | 2 | 24 | 192 | 2304 | 160 | 51.30% |
|
99 |
+
| 101 | 4 | 4 | 12 | 192 | 2160 | 165 | 52.88% |
|
100 |
+
|
101 |
+
#### Hardware
|
102 |
+
|
103 |
+
FLM-101B is trained on a cluster of 24 DGX-A800 GPU (8×80G) servers for less than 26 days. Based on the growth strategy, we sequentially completed the model training for sizes 16B, 51B, and 101B on this cluster.
|
104 |
+
|
105 |
+
#### Software
|
106 |
+
|
107 |
+
FLM-101B was trained with Megatron-FLM(open source soon...) which is modified based on Megatron-LM.
|
108 |
+
It uses a 3D(DP+TP+PP) parallelism approach and distributed optimizer.
|
109 |
+
|
110 |
+
|
111 |
+
## Bias, Risks, and Limitations
|
112 |
+
|
113 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
114 |
+
Although we've made extensive efforts to thoroughly clean and filter the training corpus for the model, due to the open nature of the dataset, the model may still have picked up on some unsafe examples. Consequently, the model may still generate unexpected content, including but not limited to discrimination, bias, or offensive language. We would like to strongly advise users not to spread any unsafe content generated by the model. The project developers cannot be held responsible for any repercussions stemming from the dissemination of harmful information.
|
115 |
+
|
116 |
+
|
117 |
+
|
118 |
+
|
119 |
+
## Citation
|
120 |
+
|
121 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
122 |
+
|
123 |
+
|
124 |
+
|
125 |
+
## Contact
|
126 |
+
|
127 |
+
tshwangyequan at gmail.com
|
README_zh.md
ADDED
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- zh
|
5 |
+
- en
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
---
|
8 |
+
|
9 |
+
<h4 align="center">
|
10 |
+
<p>
|
11 |
+
<a href="https://huggingface.co/CofeAI/FLM-101B/blob/main/README.md">English</a> |
|
12 |
+
<b>简体中文</b> |
|
13 |
+
<p>
|
14 |
+
</h4>
|
15 |
+
|
16 |
+
|
17 |
+
# FLM-101B
|
18 |
+
|
19 |
+
FLM-101B是一个开源的decoder-only架构的语言模型,参数规模101B.训练过程采用模型生长技术,通过训练前期在小规模模型上快速学习知识,后期将模型逐步生长成大模型的方式,实现了千亿规模模型的低成本(~$100K)训练。
|
20 |
+
FLM-101B支持中英双语,训练上下文窗口长度为2048,得益于使用了xPos旋转位置编码,推理时窗口大小可进行良好的拓展。
|
21 |
+
为推动千亿规模LLM技术发展,FLM-101B现已全面开源.
|
22 |
+
|
23 |
+
|
24 |
+
## 为什么使用FLM-101B
|
25 |
+
|
26 |
+
- 开源的千亿级中英双语模型
|
27 |
+
- 已知的最大规模的使用了xPos训练的语言模型
|
28 |
+
- 已知的最大规模的成功实现了μp transfer以及loss prediction的语言模型
|
29 |
+
- 已知的最大规模的成功实现了progressive learning with model growth的语言模型
|
30 |
+
|
31 |
+
## 快速上手FLM-101B
|
32 |
+
|
33 |
+
```python
|
34 |
+
import torch
|
35 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained("CofeAI/FLM-101B", trust_remote_code=True)
|
37 |
+
model = AutoModelForCausalLM.from_pretrained("CofeAI/FLM-101B", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto", trust_remote_code=True)
|
38 |
+
inputs = tokenizer('一幢没有书的房子,犹如一个没有灵魂的身体;', return_tensors='pt').to(model.device)
|
39 |
+
generated = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
|
40 |
+
print(tokenizer.decode(generated.cpu()[0], skip_special_tokens=True))
|
41 |
+
```
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
|
46 |
+
## 模型细节
|
47 |
+
|
48 |
+
### 模型简介
|
49 |
+
|
50 |
+
- **模型类型:** 解码器语言模型
|
51 |
+
- **支持语言:** 中文/英文
|
52 |
+
- **开源协议:** apache-2.0
|
53 |
+
|
54 |
+
|
55 |
+
<!-- Provide a longer summary of what this model is. -->
|
56 |
+
|
57 |
+
### 模型大小
|
58 |
+
|
59 |
+
| Hyperparameter | Value |
|
60 |
+
|----------------|-------|
|
61 |
+
| n_parameters | 101B |
|
62 |
+
| n_layers | 80 |
|
63 |
+
| n_heads | 80 |
|
64 |
+
| d_model | 10240 |
|
65 |
+
| vocab size | 100256|
|
66 |
+
| sequence length| 2048 |
|
67 |
+
|
68 |
+
|
69 |
+
### 模型架构
|
70 |
+
|
71 |
+
- **[Extrapolatable Position Embedding (xPos)](https://arxiv.org/pdf/2212.10554.pdf)**
|
72 |
+
- **[Flash Attention (In Training)](https://arxiv.org/pdf/2205.14135.pdf)**
|
73 |
+
- **[Model Growth](https://arxiv.org/pdf/2305.02869.pdf)**
|
74 |
+
- **[Loss Prediction](https://arxiv.org/abs/2304.06875)**
|
75 |
+
|
76 |
+
|
77 |
+
|
78 |
+
### 训练情况
|
79 |
+
|
80 |
+
|
81 |
+
#### 训练超参数
|
82 |
+
|
83 |
+
| **Hyperparameter** | **16b** |**51b** |**101b** |
|
84 |
+
|--------------------|-------------------|------------------|-------------------|
|
85 |
+
| Optimizer | | AdamW | |
|
86 |
+
| Precision | | bfloat16 | |
|
87 |
+
| Weight decay | | 0.1 | |
|
88 |
+
| Gradient clipping | | 1.0 | |
|
89 |
+
| Learning rate | 4e-4 | 3.4e-4 | 2e-4 |
|
90 |
+
| Batch size(M tokens)|4.72 | 4.72 | 4.31 |
|
91 |
+
| Warmup(M samples) | 4.61 | 0.23 | 0.23 |
|
92 |
+
| Time(day) | 9.63 | 5.37 | 6.54 |
|
93 |
+
| Tokens(B) | 245.37 | 39.64 | 26.54 |
|
94 |
+
|
95 |
+
|
96 |
+
#### 并行策略
|
97 |
+
|
98 |
+
| **Params(billion)** | **TP Size** | **PP Size** | **DP Size** | **Number of GPUs** | **Batch Size** | **TFLOP/s per GPU** | **GPU Utilization** |
|
99 |
+
|---------------------|--------------|--------------|--------------|---------------------|-----------------|----------------------|----------------------|
|
100 |
+
| 16 | 2 | 1 | 96 | 192 | 2304 | 162 | 51.90% |
|
101 |
+
| 51 | 4 | 2 | 24 | 192 | 2304 | 160 | 51.30% |
|
102 |
+
| 101 | 4 | 4 | 12 | 192 | 2160 | 165 | 52.88% |
|
103 |
+
|
104 |
+
#### 硬件
|
105 |
+
|
106 |
+
FLM-101B在24节点DGX-A800 GPU(8×80G)集群上完成的训练,总耗时近26天.基于模型生长策略,我们依次在该集群上进行了16B, 51B和101B的模型的训练和生长.
|
107 |
+
|
108 |
+
#### 软件
|
109 |
+
|
110 |
+
FLM-101B的训练代码Megatron-FLM基于Megatron-LM框架修改,将在近期开源.
|
111 |
+
框架支持3D并行策略以及分布式优化器.
|
112 |
+
|
113 |
+
## 偏见、风险与限制
|
114 |
+
|
115 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
116 |
+
|
117 |
+
尽管我们已经尽最大努力对模型训练语料进行了清洗过滤,但由于训练语料的开放性,模型仍有可能在一些不安全的语料上进行过学习。因此模型仍有可能生成不符合预期的文本,包括但不限于歧视、偏见、谩骂等。我们在此提醒模型使用者,请勿传播模型可能生成的不安全内容。由于传播不良信息导致的任何后果,本项目开发者不承担责任。
|
118 |
+
|
119 |
+
|
120 |
+
## Citation
|
121 |
+
|
122 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
123 |
+
|
124 |
+
|
125 |
+
|
126 |
+
## Contact
|
127 |
+
|
128 |
+
tshwangyequan at gmail.com
|