File size: 45,977 Bytes
52aeca4
1
2
Language model pre-training has been shown to be effective for improving many natural language processing tasks ( Dai and Le , 2015 ; Peters et al. , 2018a ; Radford et al. , 2018 ; Howard and Ruder , 2018 ) . These include sentence-level tasks such as natural language inference ( Bowman et al. , 2015 ; Williams et al. , 2018 ) and paraphrasing ( Dolan and Brockett , 2005 ) , which aim to predict the relationships between sentences by analyzing them holistically , as well as token-level tasks such as named entity recognition and question answering , where models are required to produce fine-grained output at the token level ( Tjong Kim Sang and De Meulder , 2003 ; Rajpurkar et al. , 2016 ) . There are two existing strategies for applying pre-trained language representations to downstream tasks : feature-based and fine-tuning . The feature-based approach , such as ELMo ( Peters et al. , 2018a ) , uses task-specific architectures that include the pre-trained representations as additional features . The fine-tuning approach , such as the Generative Pre-trained Transformer ( OpenAI GPT ) ( Radford et al. , 2018 ) , introduces minimal task-specific parameters , and is trained on the downstream tasks by simply fine-tuning all pretrained parameters . The two approaches share the same objective function during pre-training , where they use unidirectional language models to learn general language representations . We argue that current techniques restrict the power of the pre-trained representations , especially for the fine-tuning approaches . The major limitation is that standard language models are unidirectional , and this limits the choice of architectures that can be used during pre-training . For example , in OpenAI GPT , the authors use a left-toright architecture , where every token can only attend to previous tokens in the self-attention layers of the Transformer ( Vaswani et al. , 2017 ) . Such restrictions are sub-optimal for sentence-level tasks , and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering , where it is crucial to incorporate context from both directions . In this paper , we improve the fine-tuning based approaches by proposing BERT : Bidirectional Encoder Representations from Transformers . BERT alleviates the previously mentioned unidirectionality constraint by using a `` masked language model '' ( MLM ) pre-training objective , inspired by the Cloze task ( Taylor , 1953 ) . The masked language model randomly masks some of the tokens from the input , and the objective is to predict the original vocabulary id of the masked arXiv:1810.04805v2 [ cs.CL ] 24 May 2019 word based only on its context . Unlike left-toright language model pre-training , the MLM objective enables the representation to fuse the left and the right context , which allows us to pretrain a deep bidirectional Transformer . In addition to the masked language model , we also use a `` next sentence prediction '' task that jointly pretrains text-pair representations . The contributions of our paper are as follows : • We demonstrate the importance of bidirectional pre-training for language representations . Unlike Radford et al . ( 2018 ) , which uses unidirectional language models for pre-training , BERT uses masked language models to enable pretrained deep bidirectional representations . This is also in contrast to Peters et al . ( 2018a ) , which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs . • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures . BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks , outperforming many task-specific architectures . • BERT advances the state of the art for eleven NLP tasks . The code and pre-trained models are available at https : //github.com/ google-research/bert . There is a long history of pre-training general language representations , and we briefly review the most widely-used approaches in this section . Learning widely applicable representations of words has been an active area of research for decades , including non-neural ( Brown et al. , 1992 ; Ando and Zhang , 2005 ; Blitzer et al. , 2006 ) and neural Pennington et al. , 2014 ) methods . Pre-trained word embeddings are an integral part of modern NLP systems , offering significant improvements over embeddings learned from scratch ( Turian et al. , 2010 ) . To pretrain word embedding vectors , left-to-right language modeling objectives have been used ( Mnih and Hinton , 2009 ) , as well as objectives to discriminate correct from incorrect words in left and right context . These approaches have been generalized to coarser granularities , such as sentence embeddings Logeswaran and Lee , 2018 ) or paragraph embeddings ( Le and Mikolov , 2014 ) . To train sentence representations , prior work has used objectives to rank candidate next sentences ( Jernite et al. , 2017 ; Logeswaran and Lee , 2018 ) , left-to-right generation of next sentence words given a representation of the previous sentence , or denoising autoencoder derived objectives ( Hill et al. , 2016 ) . ELMo and its predecessor ( Peters et al. , 2017 ( Peters et al. , , 2018a generalize traditional word embedding research along a different dimension . They extract context-sensitive features from a left-to-right and a right-to-left language model . The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations . When integrating contextual word embeddings with existing task-specific architectures , ELMo advances the state of the art for several major NLP benchmarks ( Peters et al. , 2018a ) including question answering ( Rajpurkar et al. , 2016 ) , sentiment analysis ( Socher et al. , 2013 ) , and named entity recognition ( Tjong Kim Sang and De Meulder , 2003 ) . Melamud et al . ( 2016 ) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs . Similar to ELMo , their model is feature-based and not deeply bidirectional . Fedus et al . ( 2018 ) shows that the cloze task can be used to improve the robustness of text generation models . As with the feature-based approaches , the first works in this direction only pre-trained word embedding parameters from unlabeled text ( Collobert and Weston , 2008 ) . More recently , sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task ( Dai and Le , 2015 ; Howard and Ruder , 2018 ; Radford et al. , 2018 ) . The advantage of these approaches is that few parameters need to be learned from scratch . At least partly due to this advantage , OpenAI GPT ( Radford et al. , 2018 ) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark ( Wang et al. , 2018a ) . Left-to-right language model- BERT BERT E [ CLS ] E 1 E [ SEP ] ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Tok N Tok 1 ... TokM Question Paragraph Start/End Span BERT E [ CLS ] E 1 E [ SEP ] ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Figure 1 : Overall pre-training and fine-tuning procedures for BERT . Apart from output layers , the same architectures are used in both pre-training and fine-tuning . The same pre-trained model parameters are used to initialize models for different down-stream tasks . During fine-tuning , all parameters are fine-tuned . [ CLS ] is a special symbol added in front of every input example , and [ SEP ] is a special separator token ( e.g . separating questions/answers ) . ing and auto-encoder objectives have been used for pre-training such models ( Howard and Ruder , 2018 ; Radford et al. , 2018 ; Dai and Le , 2015 ) . There has also been work showing effective transfer from supervised tasks with large datasets , such as natural language inference ( Conneau et al. , 2017 ) and machine translation ( McCann et al. , 2017 ) . Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models , where an effective recipe is to fine-tune models pre-trained with Ima-geNet ( Deng et al. , 2009 ; Yosinski et al. , 2014 ) . We introduce BERT and its detailed implementation in this section . There are two steps in our framework : pre-training and fine-tuning . During pre-training , the model is trained on unlabeled data over different pre-training tasks . For finetuning , the BERT model is first initialized with the pre-trained parameters , and all of the parameters are fine-tuned using labeled data from the downstream tasks . Each downstream task has separate fine-tuned models , even though they are initialized with the same pre-trained parameters . The question-answering example in Figure 1 will serve as a running example for this section . A distinctive feature of BERT is its unified architecture across different tasks . There is mini-mal difference between the pre-trained architecture and the final downstream architecture . Model Architecture BERT 's model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al . ( 2017 ) and released in the tensor2tensor library . 1 Because the use of Transformers has become common and our implementation is almost identical to the original , we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al . ( 2017 ) as well as excellent guides such as `` The Annotated Transformer . '' 2 In this work , we denote the number of layers ( i.e. , Transformer blocks ) as L , the hidden size as H , and the number of self-attention heads as A . 3 We primarily report results on two model sizes : BERT BASE ( L=12 , H=768 , A=12 , Total Param-eters=110M ) and BERT LARGE ( L=24 , H=1024 , A=16 , Total Parameters=340M ) . BERT BASE was chosen to have the same model size as OpenAI GPT for comparison purposes . Critically , however , the BERT Transformer uses bidirectional self-attention , while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left . 4 Input/Output Representations To make BERT handle a variety of down-stream tasks , our input representation is able to unambiguously represent both a single sentence and a pair of sentences ( e.g. , Question , Answer ) in one token sequence . Throughout this work , a `` sentence '' can be an arbitrary span of contiguous text , rather than an actual linguistic sentence . A `` sequence '' refers to the input token sequence to BERT , which may be a single sentence or two sentences packed together . We use WordPiece embeddings ( Wu et al. , 2016 ) with a 30,000 token vocabulary . The first token of every sequence is always a special classification token ( [ CLS ] ) . The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks . Sentence pairs are packed together into a single sequence . We differentiate the sentences in two ways . First , we separate them with a special token ( [ SEP ] ) . Second , we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B . As shown in Figure 1 , we denote input embedding as E , the final hidden vector of the special [ CLS ] token as C ∈ R H , and the final hidden vector for the i th input token as T i ∈ R H . For a given token , its input representation is constructed by summing the corresponding token , segment , and position embeddings . A visualization of this construction can be seen in Figure 2 . Unlike Peters et al . ( 2018a ) and Radford et al . ( 2018 ) , we do not use traditional left-to-right or right-to-left language models to pre-train BERT . Instead , we pre-train BERT using two unsupervised tasks , described in this section . This step is presented in the left part of Figure 1 . Task # 1 : Masked LM Intuitively , it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model . Unfortunately , standard conditional language models can only be trained left-to-right or right-to-left , since bidirectional conditioning would allow each word to indirectly `` see itself '' , and the model could trivially predict the target word in a multi-layered context . former is often referred to as a `` Transformer encoder '' while the left-context-only version is referred to as a `` Transformer decoder '' since it can be used for text generation . In order to train a deep bidirectional representation , we simply mask some percentage of the input tokens at random , and then predict those masked tokens . We refer to this procedure as a `` masked LM '' ( MLM ) , although it is often referred to as a Cloze task in the literature ( Taylor , 1953 ) . In this case , the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary , as in a standard LM . In all of our experiments , we mask 15 % of all WordPiece tokens in each sequence at random . In contrast to denoising auto-encoders ( Vincent et al. , 2008 ) , we only predict the masked words rather than reconstructing the entire input . Although this allows us to obtain a bidirectional pre-trained model , a downside is that we are creating a mismatch between pre-training and fine-tuning , since the [ MASK ] token does not appear during fine-tuning . To mitigate this , we do not always replace `` masked '' words with the actual [ MASK ] token . The training data generator chooses 15 % of the token positions at random for prediction . If the i-th token is chosen , we replace the i-th token with ( 1 ) the [ MASK ] token 80 % of the time ( 2 ) a random token 10 % of the time ( 3 ) the unchanged i-th token 10 % of the time . Then , T i will be used to predict the original token with cross entropy loss . We compare variations of this procedure in Appendix C.2 . Many important downstream tasks such as Question Answering ( QA ) and Natural Language Inference ( NLI ) are based on understanding the relationship between two sentences , which is not directly captured by language modeling . In order to train a model that understands sentence relationships , we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus . Specifically , when choosing the sentences A and B for each pretraining example , 50 % of the time B is the actual next sentence that follows A ( labeled as IsNext ) , and 50 % of the time it is a random sentence from the corpus ( labeled as NotNext ) . As we show in Figure 1 , C is used for next sentence prediction ( NSP ) . 5 Despite its simplicity , we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI . 6 he likes play # # ing [ SEP ] my dog is cute [ SEP ] Input E [ CLS ] E he E likes E play E # # ing E [ SEP ] E my E dog E is E cute E [ SEP ] Token Embeddings E A E B E B E B E B E B E A E A E A E A E A Segment Embeddings E 0 E 6 E 7 E 8 E 9 E 10 E 1 E 2 E 3 E 4 E 5 Position Embeddings Figure 2 : BERT input representation . The input embeddings are the sum of the token embeddings , the segmentation embeddings and the position embeddings . The NSP task is closely related to representationlearning objectives used in Jernite et al . 2017and Logeswaran and Lee ( 2018 ) . However , in prior work , only sentence embeddings are transferred to down-stream tasks , where BERT transfers all parameters to initialize end-task model parameters . Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training . For the pre-training corpus we use the BooksCorpus ( 800M words ) and English Wikipedia ( 2,500M words ) . For Wikipedia we extract only the text passages and ignore lists , tables , and headers . It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark ( Chelba et al. , 2013 ) in order to extract long contiguous sequences . Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream taskswhether they involve single text or text pairs-by swapping out the appropriate inputs and outputs . For applications involving text pairs , a common pattern is to independently encode text pairs before applying bidirectional cross attention , such as Parikh et al . 2016 ; Seo et al . ( 2017 ) . BERT instead uses the self-attention mechanism to unify these two stages , as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences . For each task , we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end . At the input , sentence A and sentence B from pre-training are analogous to ( 1 ) sentence pairs in paraphrasing , ( 2 ) hypothesis-premise pairs in entailment , ( 3 ) question-passage pairs in question answering , and ( 4 ) a degenerate text-∅ pair in text classification or sequence tagging . At the output , the token representations are fed into an output layer for tokenlevel tasks , such as sequence tagging or question answering , and the [ CLS ] representation is fed into an output layer for classification , such as entailment or sentiment analysis . Compared to pre-training , fine-tuning is relatively inexpensive . All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU , or a few hours on a GPU , starting from the exact same pre-trained model . 7 We describe the task-specific details in the corresponding subsections of Section 4 . More details can be found in Appendix A.5 . In this section , we present BERT fine-tuning results on 11 NLP tasks . The General Language Understanding Evaluation ( GLUE ) benchmark ( Wang et al. , 2018a ) is a collection of diverse natural language understanding tasks . Detailed descriptions of GLUE datasets are included in Appendix B.1 . To fine-tune on GLUE , we represent the input sequence ( for single sentence or sentence pairs ) as described in Section 3 , and use the final hidden vector C ∈ R H corresponding to the first input token ( [ CLS ] ) as the aggregate representation . The only new parameters introduced during fine-tuning are classification layer weights W ∈ R K×H , where K is the number of labels . We compute a standard classification loss with C and W , i.e. , log ( softmax ( CW T ) ) . Table 1 : GLUE Test results , scored by the evaluation server ( https : //gluebenchmark.com/leaderboard ) . The number below each task denotes the number of training examples . The `` Average '' column is slightly different than the official GLUE score , since we exclude the problematic WNLI set . 8 BERT and OpenAI GPT are singlemodel , single task . F1 scores are reported for QQP and MRPC , Spearman correlations are reported for STS-B , and accuracy scores are reported for the other tasks . We exclude entries that use BERT as one of their components . We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks . For each task , we selected the best fine-tuning learning rate ( among 5e-5 , 4e-5 , 3e-5 , and 2e-5 ) on the Dev set . Additionally , for BERT LARGE we found that finetuning was sometimes unstable on small datasets , so we ran several random restarts and selected the best model on the Dev set . With random restarts , we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization . 9 Results are presented in Table 1 . Both BERT BASE and BERT LARGE outperform all systems on all tasks by a substantial margin , obtaining 4.5 % and 7.0 % respective average accuracy improvement over the prior state of the art . Note that BERT BASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking . For the largest and most widely reported GLUE task , MNLI , BERT obtains a 4.6 % absolute accuracy improvement . On the official GLUE leaderboard 10 , BERT LARGE obtains a score of 80.5 , compared to OpenAI GPT , which obtains 72.8 as of the date of writing . We find that BERT LARGE significantly outperforms BERT BASE across all tasks , especially those with very little training data . The effect of model size is explored more thoroughly in Section 5.2 . The Stanford Question Answering Dataset ( SQuAD v1.1 ) is a collection of 100k crowdsourced question/answer pairs ( Rajpurkar et al. , 2016 ) . Given a question and a passage from Wikipedia containing the answer , the task is to predict the answer text span in the passage . As shown in Figure 1 , in the question answering task , we represent the input question and passage as a single packed sequence , with the question using the A embedding and the passage using the B embedding . We only introduce a start vector S ∈ R H and an end vector E ∈ R H during fine-tuning . The probability of word i being the start of the answer span is computed as a dot product between T i and S followed by a softmax over all of the words in the paragraph : P i = e S•T i j e S•T j . The analogous formula is used for the end of the answer span . The score of a candidate span from position i to position j is defined as S•T i + E•T j , and the maximum scoring span where j ≥ i is used as a prediction . The training objective is the sum of the log-likelihoods of the correct start and end positions . We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32 . Table 2 shows top leaderboard entries as well as results from top published systems ( Seo et al. , 2017 ; Clark and Gardner , 2018 ; Peters et al. , 2018a ; Hu et al. , 2018 ) . The top results from the SQuAD leaderboard do not have up-to-date public system descriptions available , 11 and are allowed to use any public data when training their systems . We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA ( Joshi et al. , 2017 ) befor fine-tuning on SQuAD . Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system . In fact , our single BERT model outperforms the top ensemble system in terms of F1 score . Without TriviaQA fine- tuning data , we only lose 0.1-0.4 F1 , still outperforming all existing systems by a wide margin . 12 The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph , making the problem more realistic . We use a simple approach to extend the SQuAD v1.1 BERT model for this task . We treat questions that do not have an answer as having an answer span with start and end at the [ CLS ] token . The probability space for the start and end answer span positions is extended to include the position of the [ CLS ] token . For prediction , we compare the score of the no-answer span : s null = S•C + E•C to the score of the best non-null span 12 The TriviaQA data we used consists of paragraphs from TriviaQA-Wiki formed of the first 400 tokens in documents , that contain at least one of the provided possible answers . Dev Test ESIM+GloVe s i , j = max j≥i S•T i + E•T j . We predict a non-null answer whenŝ i , j > s null + τ , where the threshold τ is selected on the dev set to maximize F1 . We did not use TriviaQA data for this model . We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48 . The results compared to prior leaderboard entries and top published work ( Sun et al. , 2018 ; Wang et al. , 2018b ) are shown in Table 3 , excluding systems that use BERT as one of their components . We observe a +5.1 F1 improvement over the previous best system . The Situations With Adversarial Generations ( SWAG ) dataset contains 113k sentence-pair completion examples that evaluate grounded commonsense inference ( Zellers et al. , 2018 ) . Given a sentence , the task is to choose the most plausible continuation among four choices . When fine-tuning on the SWAG dataset , we construct four input sequences , each containing the concatenation of the given sentence ( sentence A ) and a possible continuation ( sentence B ) . The only task-specific parameters introduced is a vector whose dot product with the [ CLS ] token representation C denotes a score for each choice which is normalized with a softmax layer . We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16 . Results are presented in Table 4 . BERT LARGE outperforms the authors ' baseline ESIM+ELMo system by +27.1 % and OpenAI GPT by 8.3 % . In this section , we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance . Additional Table 5 : Ablation over the pre-training tasks using the BERT BASE architecture . `` No NSP '' is trained without the next sentence prediction task . `` LTR & No NSP '' is trained as a left-to-right LM without the next sentence prediction , like OpenAI GPT . `` + BiLSTM '' adds a randomly initialized BiLSTM on top of the `` LTR + No NSP '' model during fine-tuning . ablation studies can be found in Appendix C . We demonstrate the importance of the deep bidirectionality of BERT by evaluating two pretraining objectives using exactly the same pretraining data , fine-tuning scheme , and hyperparameters as BERT BASE : No NSP : A bidirectional model which is trained using the `` masked LM '' ( MLM ) but without the `` next sentence prediction '' ( NSP ) task . A left-context-only model which is trained using a standard Left-to-Right ( LTR ) LM , rather than an MLM . The left-only constraint was also applied at fine-tuning , because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance . Additionally , this model was pre-trained without the NSP task . This is directly comparable to OpenAI GPT , but using our larger training dataset , our input representation , and our fine-tuning scheme . We first examine the impact brought by the NSP task . In Table 5 , we show that removing NSP hurts performance significantly on QNLI , MNLI , and SQuAD 1.1 . Next , we evaluate the impact of training bidirectional representations by comparing `` No NSP '' to `` LTR & No NSP '' . The LTR model performs worse than the MLM model on all tasks , with large drops on MRPC and SQuAD . For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions , since the token-level hidden states have no rightside context . In order to make a good faith attempt at strengthening the LTR system , we added a randomly initialized BiLSTM on top . This does significantly improve results on SQuAD , but the results are still far worse than those of the pretrained bidirectional models . The BiLSTM hurts performance on the GLUE tasks . We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models , as ELMo does . However : ( a ) this is twice as expensive as a single bidirectional model ; ( b ) this is non-intuitive for tasks like QA , since the RTL model would not be able to condition the answer on the question ; ( c ) this it is strictly less powerful than a deep bidirectional model , since it can use both left and right context at every layer . In this section , we explore the effect of model size on fine-tuning task accuracy . We trained a number of BERT models with a differing number of layers , hidden units , and attention heads , while otherwise using the same hyperparameters and training procedure as described previously . Results on selected GLUE tasks are shown in Table 6 . In this table , we report the average Dev Set accuracy from 5 random restarts of fine-tuning . We can see that larger models lead to a strict accuracy improvement across all four datasets , even for MRPC which only has 3,600 labeled training examples , and is substantially different from the pre-training tasks . It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature . For example , the largest Transformer explored in Vaswani et al . ( 2017 ) is ( L=6 , H=1024 , A=16 ) with 100M parameters for the encoder , and the largest Transformer we have found in the literature is ( L=64 , H=512 , A=2 ) with 235M parameters ( Al-Rfou et al. , 2018 ) . By contrast , BERT BASE contains 110M parameters and BERT LARGE contains 340M parameters . It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling , which is demonstrated by the LM perplexity of held-out training data shown in Table 6 . However , we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks , provided that the model has been sufficiently pre-trained . Peters et al . ( 2018b ) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al . ( 2016 ) mentioned in passing that increasing hidden dimension size from 200 to 600 helped , but increasing further to 1,000 did not bring further improvements . Both of these prior works used a featurebased approach -we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters , the taskspecific models can benefit from the larger , more expressive pre-trained representations even when downstream task data is very small . All of the BERT results presented so far have used the fine-tuning approach , where a simple classification layer is added to the pre-trained model , and all parameters are jointly fine-tuned on a downstream task . However , the feature-based approach , where fixed features are extracted from the pretrained model , has certain advantages . First , not all tasks can be easily represented by a Transformer encoder architecture , and therefore require a task-specific model architecture to be added . Second , there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation . In this section , we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition ( NER ) task ( Tjong Kim Sang and De Meulder , 2003 ) . In the input to BERT , we use a case-preserving WordPiece model , and we include the maximal document context provided by the data . Following standard practice , we formulate this as a tagging task but do not use a CRF Table 6 : Ablation over BERT model size . # L = the number of layers ; # H = hidden size ; # A = number of attention heads . `` LM ( ppl ) '' is the masked LM perplexity of held-out training data . Dev F1 Test F1 ELMo ( Peters et al. , 2018a ) 95.7 92.2 CVT -92.6 CSE ( Akbik et al. , 2018 layer in the output . We use the representation of the first sub-token as the input to the token-level classifier over the NER label set . To ablate the fine-tuning approach , we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT . These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer . Results are presented in Table 7 . BERT LARGE performs competitively with state-of-the-art methods . The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer , which is only 0.3 F1 behind fine-tuning the entire model . This demonstrates that BERT is effective for both finetuning and feature-based approaches . Recent empirical improvements due to transfer learning with language models have demonstrated that rich , unsupervised pre-training is an integral part of many language understanding systems . In particular , these results enable even low-resource tasks to benefit from deep unidirectional architectures . Our major contribution is further generalizing these findings to deep bidirectional architectures , allowing the same pre-trained model to successfully tackle a broad set of NLP tasks . Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy , and during the random masking procedure we chose the 4-th token ( which corresponding to hairy ) , our masking procedure can be further illustrated by • 10 % of the time : Replace the word with a random word , e.g. , my dog is hairy → my dog is apple • 10 % of the time : Keep the word unchanged , e.g. , my dog is hairy → my dog is hairy . The purpose of this is to bias the representation towards the actual observed word . The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words , so it is forced to keep a distributional contextual representation of every input token . Additionally , because random replacement only occurs for 1.5 % of all tokens ( i.e. , 10 % of 15 % ) , this does not seem to harm the model 's language understanding capability . In Section C.2 , we evaluate the impact this procedure . Compared to standard langauge model training , the masked LM only make predictions on 15 % of tokens in each batch , which suggests that more pre-training steps may be required for the model to converge . In Section C.1 we demonstrate that MLM does converge marginally slower than a leftto-right model ( which predicts every token ) , but the empirical improvements of the MLM model far outweigh the increased training cost . T 1 T 2 T N ... ... ... ... ... E 1 E 2 E N ... T 1 T 2 T N ... E 1 E 2 E N ... T 1 T 2 T N ... E 1 E 2 E N ... Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples . To generate each training input sequence , we sample two spans of text from the corpus , which we refer to as `` sentences '' even though they are typically much longer than single sentences ( but can be shorter also ) . The first sentence receives the A embedding and the second receives the B embedding . 50 % of the time B is the actual next sentence that follows A and 50 % of the time it is a random sentence , which is done for the `` next sentence prediction '' task . They are sampled such that the combined length is ≤ 512 tokens . The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15 % , and no special consideration given to partial word pieces . We train with batch size of 256 sequences ( 256 sequences * 512 tokens = 128,000 tokens/batch ) for 1,000,000 steps , which is approximately 40 epochs over the 3.3 billion word corpus . We use Adam with learning rate of 1e-4 , β 1 = 0.9 , β 2 = 0.999 , L2 weight decay of 0.01 , learning rate warmup over the first 10,000 steps , and linear decay of the learning rate . We use a dropout probability of 0.1 on all layers . We use a gelu activation ( Hendrycks and Gimpel , 2016 ) rather than the standard relu , following OpenAI GPT . The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood . Training of BERT BASE was performed on 4 Cloud TPUs in Pod configuration ( 16 TPU chips total ) . 13 Training of BERT LARGE was performed on 16 Cloud TPUs ( 64 TPU chips total ) . Each pretraining took 4 days to complete . Longer sequences are disproportionately expensive because attention is quadratic to the sequence length . To speed up pretraing in our experiments , we pre-train the model with sequence length of 128 for 90 % of the steps . Then , we train the rest 10 % of the steps of sequence of 512 to learn the positional embeddings . For fine-tuning , most model hyperparameters are the same as in pre-training , with the exception of the batch size , learning rate , and number of training epochs . The dropout probability was always kept at 0.1 . The optimal hyperparameter values are task-specific , but we found the following range of possible values to work well across all tasks : • Batch size : 16 , 32 • Learning rate ( Adam ) : 5e-5 , 3e-5 , 2e-5 • Number of epochs : 2 , 3 , 4 We also observed that large data sets ( e.g. , 100k+ labeled training examples ) were far less sensitive to hyperparameter choice than small data sets . Fine-tuning is typically very fast , so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set . OpenAI GPT Here we studies the differences in recent popular representation learning models including ELMo , OpenAI GPT and BERT . The comparisons between the model architectures are shown visually in Figure 3 . Note that in addition to the architecture differences , BERT and OpenAI GPT are finetuning approaches , while ELMo is a feature-based approach . The most comparable existing pre-training method to BERT is OpenAI GPT , which trains a left-to-right Transformer LM on a large text corpus . In fact , many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared . The core argument of this work is that the bi-directionality and the two pretraining tasks presented in Section 3.1 account for the majority of the empirical improvements , but we do note that there are several other differences between how BERT and GPT were trained : • GPT is trained on the BooksCorpus ( 800M words ) ; BERT is trained on the BooksCorpus ( 800M words ) and Wikipedia ( 2,500M words ) . • • GPT was trained for 1M steps with a batch size of 32,000 words ; BERT was trained for 1M steps with a batch size of 128,000 words . • GPT used the same learning rate of 5e-5 for all fine-tuning experiments ; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set . To isolate the effect of these differences , we perform ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectionality they enable . The illustration of fine-tuning BERT on different tasks can be seen in Figure 4 . Our task-specific models are formed by incorporating BERT with one additional output layer , so a minimal number of parameters need to be learned from scratch . Among the tasks , Our GLUE results in Table1 are obtained from https : //gluebenchmark.com/ leaderboard and https : //blog . openai.com/language-unsupervised . The GLUE benchmark includes the following datasets , the descriptions of which were originally summarized in Wang et al . ( 2018a ) : MNLI Multi-Genre Natural Language Inference is a large-scale , crowdsourced entailment classification task ( Williams et al. , 2018 ) . Given a pair of sentences , the goal is to predict whether the second sentence is an entailment , contradiction , or neutral with respect to the first one . QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent . QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset ( Rajpurkar et al. , 2016 ) which has been converted to a binary classification task ( Wang et al. , 2018a ) . The positive examples are ( question , sentence ) pairs which do contain the correct answer , and the negative examples are ( question , sentence ) from the same paragraph which do not contain the answer . BERT E [ CLS ] E 1 E [ SEP ] . .. E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... T M ' [ CLS ] Tok 1 [ SEP ] ... Tok N Tok 1 ... Tok M Question Paragraph BERT E [ CLS ] E 1 E 2 E N C T 1 T 2 T N Single Sentence ... ... BERT Tok 1 Tok 2 Tok N ... [ CLS ] E [ CLS ] E 1 E 2 E N C T 1 T 2 T N Single Sentence B-PER O O ... ... E [ CLS ] E 1 E [ SEP ] Class Label ... E N E 1 ' ... E M ' C T 1 T [ SEP ] ... T N T 1 ' ... The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment ( Socher et al. , 2013 ) . CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task , where the goal is to predict whether an English sentence is linguistically `` acceptable '' or not ( Warstadt et al. , 2018 ) . The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources ( Cer et al. , 2017 ) . They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning . MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources , with human annotations for whether the sentences in the pair are semantically equivalent ( Dolan and Brockett , 2005 ) . RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI , but with much less training data ( Bentivogli et al. , 2009 ) . 14 WNLI Winograd NLI is a small natural language inference dataset ( Levesque et al. , 2011 ) . The GLUE webpage notes that there are issues with the construction of this dataset , 15 and every trained system that 's been submitted to GLUE has performed worse than the 65.1 baseline accuracy of predicting the majority class . We therefore exclude this set to be fair to OpenAI GPT . For our GLUE submission , we always predicted the ma-jority class . C.1 Effect of Number of Training Steps Figure 5 presents MNLI Dev accuracy after finetuning from a checkpoint that has been pre-trained for k steps . This allows us to answer the following questions : 1 . Question : Does BERT really need such a large amount of pre-training ( 128,000 words/batch * 1,000,000 steps ) to achieve high fine-tuning accuracy ? Answer : Yes , BERT BASE achieves almost 1.0 % additional accuracy on MNLI when trained on 1M steps compared to 500k steps . 2 . Question : Does MLM pre-training converge slower than LTR pre-training , since only 15 % of words are predicted in each batch rather than every word ? Answer : The MLM model does converge slightly slower than the LTR model . However , in terms of absolute accuracy the MLM model begins to outperform the LTR model almost immediately . In Section 3.1 , we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the masked language model ( MLM ) objective . The following is an ablation study to evaluate the effect of different masking strategies . Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning , as the [ MASK ] symbol never appears during the fine-tuning stage . We report the Dev results for both MNLI and NER . For NER , we report both fine-tuning and feature-based approaches , as we expect the mismatch will be amplified for the feature-based approach as the model will not have the chance to adjust the representations . The results are presented in Table 8 . In the table , MASK means that we replace the target token with the [ MASK ] symbol for MLM ; SAME means that we keep the target token as is ; RND means that we replace the target token with another random token . The numbers in the left part of the table represent the probabilities of the specific strategies used during MLM pre-training ( BERT uses 80 % , 10 % , 10 % ) . The right part of the paper represents the Dev set results . For the feature-based approach , we concatenate the last 4 layers of BERT as the features , which was shown to be the best approach in Section 5.3 . From the table it can be seen that fine-tuning is surprisingly robust to different masking strategies . However , as expected , using only the MASK strategy was problematic when applying the featurebased approach to NER . Interestingly , using only the RND strategy performs much worse than our strategy as well . https : //github.com/tensorflow/tensor2tensor 2 http : //nlp.seas.harvard.edu/2018/04/03/attention.html 3 In all cases we set the feed-forward/filter size to be 4H , i.e. , 3072 for the H = 768 and 4096 for the H = 1024 . The final model achieves 97 % -98 % accuracy on NSP.6 The vector C is not a meaningful sentence representation without fine-tuning , since it was trained with NSP . For example , the BERT SQuAD model can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0 % .8 See ( 10 ) in https : //gluebenchmark.com/faq . The GLUE data set distribution does not include the Test labels , and we only made a single GLUE evaluation server submission for each of BERTBASE and BERTLARGE.10 https : //gluebenchmark.com/leaderboard QANet is described inYu et al . ( 2018 ) , but the system has improved substantially after publication . https : //cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-globalavailability.html Note that we only report single-task fine-tuning results in this paper . A multitask fine-tuning approach could potentially push the performance even further . For example , we did observe substantial improvements on RTE from multitask training with MNLI.15 https : //gluebenchmark.com/faq