nilabhra commited on
Commit
127bec8
β€’
1 Parent(s): bd0d08d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -21
README.md CHANGED
@@ -1,13 +1,13 @@
1
- # πŸš€ Falcon-11B
2
 
3
- **Falcon-11B is a 11B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. The model is made available under the Apache 2.0 license.**
4
 
5
  *Paper coming soon 😊.*
6
 
7
 
8
  πŸ€— To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!
9
 
10
- ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.** If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at [Falcon-11B-Chat](https://huggingface.co/tiiuae/falcon-11B-chat).
11
 
12
  ```python
13
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -41,7 +41,7 @@ for seq in sequences:
41
 
42
  For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost]((https://huggingface.co/blog/falcon).
43
 
44
- # Model Card for Falcon-11B
45
 
46
  ## Model Details
47
 
@@ -68,11 +68,11 @@ Production use without adequate assessment of risks and mitigation; any use case
68
 
69
  ## Bias, Risks, and Limitations
70
 
71
- Falcon-11B is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
72
 
73
  ### Recommendations
74
 
75
- We recommend users of Falcon-11B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
76
 
77
  ## How to Get Started with the Model
78
 
@@ -109,28 +109,26 @@ for seq in sequences:
109
 
110
  ### Training Data
111
 
112
- Falcon-11B was trained over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. It followed a 4 stage training strategy. The first three stages being focused on increasing the context length, from to 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high quality data.
113
 
114
  Overall, the data sources included RefinedWeb-English, Refined Web-Europe (en, de, es, fr, it, pt, pl, nl, ro, sv, cs), high quality technical data, code data, and conversational data extracted from public sources.
115
 
116
 
117
- | Technical | 2% | 20B | arXiv, PubMed, USPTO, etc. |
118
-
119
  The training stages were as follows:
120
 
121
- | **Stage** | **Context length** | ** Tokens** |
122
  |--------------|-----------------|-------------|
123
- | Stage 1 | 2048 | 4500B |
124
- | Stage 2 | 4096 | 250B |
125
- | Stage 3 | 8192 | 250B |
126
- | Stage 4 | 8192 | 240B |
127
 
128
 
129
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
130
 
131
  ### Training Procedure
132
 
133
- Falcon-11B was trained on 1024 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO.
134
 
135
  #### Training Hyperparameters
136
 
@@ -138,10 +136,10 @@ Falcon-11B was trained on 1024 A100 40GB GPUs, using a 3D parallelism strategy (
138
  |--------------------|------------|-------------------------------------------|
139
  | Precision | `bfloat16` | |
140
  | Optimizer | AdamW | |
141
- | Max learning rate | 3.7e-4 | Following a linear warm=up, then cosine decay to 1.89e-5 across 4500 B tokens. |
142
  | Weight decay | 1e-1 | |
143
  | Z-loss | 1e-4 | |
144
- | Batch size | Variable | Batch size was gradually increased duringthe training |
145
 
146
 
147
  #### Speeds, Sizes, Times
@@ -168,7 +166,7 @@ We thank the leaderboard team from HuggingFace for providing an official evaluat
168
 
169
  ### Model Architecture and Objective
170
 
171
- Falcon-11B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
172
 
173
  The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:
174
 
@@ -190,11 +188,11 @@ For multiquery, we are using an internal variant which uses independent key and
190
 
191
  #### Hardware
192
 
193
- Falcon-11B was trained on AWS SageMaker, using on average 1024 A100 40GB GPUs in 128 p4d instances.
194
 
195
  #### Software
196
 
197
- Falcon-11B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention2, etc.)
198
 
199
  ## Citation
200
 
@@ -202,7 +200,7 @@ Falcon-11B was trained a custom distributed training codebase, Gigatron. It uses
202
 
203
  ## License
204
 
205
- Falcon2 11B is licenced under TII Falcon License 2.0, the permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI.
206
 
207
  ## Contact
208
 
1
+ # πŸš€ Falcon2-11B
2
 
3
+ **Falcon2-11B is a 11B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. The model is made available under the Apache 2.0 license.**
4
 
5
  *Paper coming soon 😊.*
6
 
7
 
8
  πŸ€— To get started with Falcon (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost fron HF](https://huggingface.co/blog/falcon)!
9
 
10
+ ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.** If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at [Falcon2-11B-Chat](https://huggingface.co/tiiuae/Falcon2-11B-chat).
11
 
12
  ```python
13
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
41
 
42
  For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost]((https://huggingface.co/blog/falcon).
43
 
44
+ # Model Card for Falcon2-11B
45
 
46
  ## Model Details
47
 
 
68
 
69
  ## Bias, Risks, and Limitations
70
 
71
+ Falcon2-11B is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
72
 
73
  ### Recommendations
74
 
75
+ We recommend users of Falcon2-11B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
76
 
77
  ## How to Get Started with the Model
78
 
 
109
 
110
  ### Training Data
111
 
112
+ Falcon2-11B was trained over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. It followed a 4 stage training strategy. The first three stages being focused on increasing the context length, from to 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high quality data.
113
 
114
  Overall, the data sources included RefinedWeb-English, Refined Web-Europe (en, de, es, fr, it, pt, pl, nl, ro, sv, cs), high quality technical data, code data, and conversational data extracted from public sources.
115
 
116
 
 
 
117
  The training stages were as follows:
118
 
119
+ | **Stage** | **Context length** | **Tokens** |
120
  |--------------|-----------------|-------------|
121
+ | Stage 1 | 2048 | 4500B |
122
+ | Stage 2 | 4096 | 250B |
123
+ | Stage 3 | 8192 | 250B |
124
+ | Stage 4 | 8192 | 500B |
125
 
126
 
127
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
128
 
129
  ### Training Procedure
130
 
131
+ Falcon2-11B was trained on 1024 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.
132
 
133
  #### Training Hyperparameters
134
 
 
136
  |--------------------|------------|-------------------------------------------|
137
  | Precision | `bfloat16` | |
138
  | Optimizer | AdamW | |
139
+ | Max learning rate | 3.7e-4 | Following a linear warm-up, then cosine decay to 1.89e-5 across 4500 B tokens. |
140
  | Weight decay | 1e-1 | |
141
  | Z-loss | 1e-4 | |
142
+ | Batch size | Variable | Batch size was gradually increased during the training |
143
 
144
 
145
  #### Speeds, Sizes, Times
 
166
 
167
  ### Model Architecture and Objective
168
 
169
+ Falcon2-11B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
170
 
171
  The architecture is broadly adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), with the following differences:
172
 
 
188
 
189
  #### Hardware
190
 
191
+ Falcon2-11B was trained on AWS SageMaker, using on average 1024 A100 40GB GPUs in 128 p4d instances.
192
 
193
  #### Software
194
 
195
+ Falcon2-11B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention2, etc.)
196
 
197
  ## Citation
198
 
 
200
 
201
  ## License
202
 
203
+ Falcon2-11B is licenced under TII Falcon License 2.0, the permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI.
204
 
205
  ## Contact
206