BSC-LT
/

gpt2-large-bne

Text Generation

national library of spain

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

asier-gutierrez commited on Aug 19, 2021

Commit

a1e93cb

•

1 Parent(s): 4010b9b

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -30,7 +30,7 @@ Some of the statistics of the corpus:
 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
 ## Tokenization and pre-training
-The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
 ## Evaluation and results
 For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).

 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
 ## Tokenization and pre-training
+The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [GPT-2](http://www.persagen.com/files/misc/radford2019language.pdf) model with a vocabulary size of 50,262 tokens. The GPT2-large-bne pre-training consists of an autoregressive language model training that follows the approach of the GPT-2. The training lasted a total of 10 days with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
 ## Evaluation and results
 For evaluation details visit our [GitHub repository](https://github.com/PlanTL-SANIDAD/lm-spanish).