lysandre HF staff commited on
Commit
f16a3c9
1 Parent(s): fc67c89

Update code samples & dimensions

Browse files
Files changed (1) hide show
  1. README.md +259 -0
README.md CHANGED
@@ -1,3 +1,262 @@
1
  ---
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: apache-2.0
4
+ datasets:
5
+ - bookcorpus
6
+ - wikipedia
7
  ---
8
+
9
+ # BERT large model (cased)
10
+
11
+ Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
12
+ [this paper](https://arxiv.org/abs/1810.04805) and first released in
13
+ [this repository](https://github.com/google-research/bert). This model is cased: it makes a difference
14
+ between english and English.
15
+
16
+ Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by
17
+ the Hugging Face team.
18
+
19
+ ## Model description
20
+
21
+ BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
22
+ was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
23
+ publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
24
+ was pretrained with two objectives:
25
+
26
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
27
+ the entire masked sentence through the model and has to predict the masked words. This is different from traditional
28
+ recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
29
+ GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
30
+ sentence.
31
+ - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
32
+ they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
33
+ predict if the two sentences were following each other or not.
34
+
35
+ This way, the model learns an inner representation of the English language that can then be used to extract features
36
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
37
+ classifier using the features produced by the BERT model as inputs.
38
+
39
+ This model has the following configuration:
40
+
41
+ - 24-layer
42
+ - 1024 hidden dimension
43
+ - 16 attention heads
44
+ - 336M parameters.
45
+
46
+ ## Intended uses & limitations
47
+
48
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
49
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
50
+ fine-tuned versions on a task that interests you.
51
+
52
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
53
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
54
+ generation you should look at model like GPT2.
55
+
56
+ ### How to use
57
+
58
+ You can use this model directly with a pipeline for masked language modeling:
59
+
60
+ ```python
61
+ >>> from transformers import pipeline
62
+ >>> unmasker = pipeline('fill-mask', model='bert-large-cased')
63
+ >>> unmasker("Hello I'm a [MASK] model.")
64
+ [
65
+ {
66
+ "sequence":"[CLS] Hello I'm a male model. [SEP]",
67
+ "score":0.22748498618602753,
68
+ "token":2581,
69
+ "token_str":"male"
70
+ },
71
+ {
72
+ "sequence":"[CLS] Hello I'm a fashion model. [SEP]",
73
+ "score":0.09146175533533096,
74
+ "token":4633,
75
+ "token_str":"fashion"
76
+ },
77
+ {
78
+ "sequence":"[CLS] Hello I'm a new model. [SEP]",
79
+ "score":0.05823173746466637,
80
+ "token":1207,
81
+ "token_str":"new"
82
+ },
83
+ {
84
+ "sequence":"[CLS] Hello I'm a super model. [SEP]",
85
+ "score":0.04488750174641609,
86
+ "token":7688,
87
+ "token_str":"super"
88
+ },
89
+ {
90
+ "sequence":"[CLS] Hello I'm a famous model. [SEP]",
91
+ "score":0.03271442651748657,
92
+ "token":2505,
93
+ "token_str":"famous"
94
+ }
95
+ ]
96
+ ```
97
+
98
+ Here is how to use this model to get the features of a given text in PyTorch:
99
+
100
+ ```python
101
+ from transformers import BertTokenizer, BertModel
102
+ tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
103
+ model = BertModel.from_pretrained("bert-large-cased")
104
+ text = "Replace me by any text you'd like."
105
+ encoded_input = tokenizer(text, return_tensors='pt')
106
+ output = model(**encoded_input)
107
+ ```
108
+
109
+ and in TensorFlow:
110
+
111
+ ```python
112
+ from transformers import BertTokenizer, TFBertModel
113
+ tokenizer = BertTokenizer.from_pretrained('bert-large-cased')
114
+ model = TFBertModel.from_pretrained("bert-large-cased")
115
+ text = "Replace me by any text you'd like."
116
+ encoded_input = tokenizer(text, return_tensors='tf')
117
+ output = model(encoded_input)
118
+ ```
119
+
120
+ ### Limitations and bias
121
+
122
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
123
+ predictions:
124
+
125
+ ```python
126
+ >>> from transformers import pipeline
127
+ >>> unmasker = pipeline('fill-mask', model='bert-large-cased')
128
+ >>> unmasker("The man worked as a [MASK].")
129
+ [
130
+ {
131
+ "sequence":"[CLS] The man worked as a doctor. [SEP]",
132
+ "score":0.0645911768078804,
133
+ "token":3995,
134
+ "token_str":"doctor"
135
+ },
136
+ {
137
+ "sequence":"[CLS] The man worked as a cop. [SEP]",
138
+ "score":0.057450827211141586,
139
+ "token":9947,
140
+ "token_str":"cop"
141
+ },
142
+ {
143
+ "sequence":"[CLS] The man worked as a mechanic. [SEP]",
144
+ "score":0.04392256215214729,
145
+ "token":19459,
146
+ "token_str":"mechanic"
147
+ },
148
+ {
149
+ "sequence":"[CLS] The man worked as a waiter. [SEP]",
150
+ "score":0.03755280375480652,
151
+ "token":17989,
152
+ "token_str":"waiter"
153
+ },
154
+ {
155
+ "sequence":"[CLS] The man worked as a teacher. [SEP]",
156
+ "score":0.03458863124251366,
157
+ "token":3218,
158
+ "token_str":"teacher"
159
+ }
160
+ ]
161
+
162
+ >>> unmasker("The woman worked as a [MASK].")
163
+ [
164
+ {
165
+ "sequence":"[CLS] The woman worked as a nurse. [SEP]",
166
+ "score":0.2572779953479767,
167
+ "token":7439,
168
+ "token_str":"nurse"
169
+ },
170
+ {
171
+ "sequence":"[CLS] The woman worked as a waitress. [SEP]",
172
+ "score":0.16706500947475433,
173
+ "token":15098,
174
+ "token_str":"waitress"
175
+ },
176
+ {
177
+ "sequence":"[CLS] The woman worked as a teacher. [SEP]",
178
+ "score":0.04587847739458084,
179
+ "token":3218,
180
+ "token_str":"teacher"
181
+ },
182
+ {
183
+ "sequence":"[CLS] The woman worked as a secretary. [SEP]",
184
+ "score":0.03577028587460518,
185
+ "token":4848,
186
+ "token_str":"secretary"
187
+ },
188
+ {
189
+ "sequence":"[CLS] The woman worked as a maid. [SEP]",
190
+ "score":0.03298963978886604,
191
+ "token":13487,
192
+ "token_str":"maid"
193
+ }
194
+ ]
195
+ ```
196
+
197
+ This bias will also affect all fine-tuned versions of this model.
198
+
199
+ ## Training data
200
+
201
+ The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
202
+ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
203
+ headers).
204
+
205
+ ## Training procedure
206
+
207
+ ### Preprocessing
208
+
209
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
210
+ then of the form:
211
+
212
+ ```
213
+ [CLS] Sentence A [SEP] Sentence B [SEP]
214
+ ```
215
+
216
+ With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
217
+ the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
218
+ consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
219
+ "sentences" has a combined length of less than 512 tokens.
220
+
221
+ The details of the masking procedure for each sentence are the following:
222
+ - 15% of the tokens are masked.
223
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
224
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
225
+ - In the 10% remaining cases, the masked tokens are left as is.
226
+
227
+ ### Pretraining
228
+
229
+ The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
230
+ of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
231
+ used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
232
+ learning rate warmup for 10,000 steps and linear decay of the learning rate after.
233
+
234
+ ## Evaluation results
235
+
236
+ When fine-tuned on downstream tasks, this model achieves the following results:
237
+
238
+ Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy
239
+ ---------------------------------------- | :-------------: | :----------------:
240
+ BERT-Large, Cased (Original) | 91.5/84.8 | 86.09
241
+
242
+ ### BibTeX entry and citation info
243
+
244
+ ```bibtex
245
+ @article{DBLP:journals/corr/abs-1810-04805,
246
+ author = {Jacob Devlin and
247
+ Ming{-}Wei Chang and
248
+ Kenton Lee and
249
+ Kristina Toutanova},
250
+ title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
251
+ Understanding},
252
+ journal = {CoRR},
253
+ volume = {abs/1810.04805},
254
+ year = {2018},
255
+ url = {http://arxiv.org/abs/1810.04805},
256
+ archivePrefix = {arXiv},
257
+ eprint = {1810.04805},
258
+ timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
259
+ biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
260
+ bibsource = {dblp computer science bibliography, https://dblp.org}
261
+ }
262
+ ```