gonzalez-agirre
commited on
Commit
•
1abca30
1
Parent(s):
55c25fb
Update README.md
Browse files
README.md
CHANGED
@@ -15,13 +15,17 @@ tags:
|
|
15 |
|
16 |
- "gpt2-base-bne"
|
17 |
|
|
|
|
|
|
|
|
|
18 |
widget:
|
19 |
- text: "El modelo del lenguaje GPT es capaz de"
|
20 |
- text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
|
21 |
|
22 |
---
|
23 |
|
24 |
-
# GPT2-base (gpt2-base-bne) trained with data from National Library of Spain (BNE)
|
25 |
|
26 |
## Table of Contents
|
27 |
<details>
|
@@ -48,7 +52,7 @@ widget:
|
|
48 |
|
49 |
## Overview
|
50 |
|
51 |
-
- **Architecture:** gpt2-base
|
52 |
- **Language:** Spanish
|
53 |
- **Task:** text-generation
|
54 |
- **Data:** BNE
|
@@ -96,8 +100,7 @@ torch.Size([1, 14, 768])
|
|
96 |
|
97 |
## Limitations and bias
|
98 |
|
99 |
-
|
100 |
-
unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
|
101 |
|
102 |
```python
|
103 |
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
|
@@ -128,7 +131,7 @@ unfiltered content from the internet, which is far from neutral. Here's an examp
|
|
128 |
|
129 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
130 |
|
131 |
-
To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among
|
132 |
|
133 |
Some of the statistics of the corpus:
|
134 |
|
|
|
15 |
|
16 |
- "gpt2-base-bne"
|
17 |
|
18 |
+
datasets:
|
19 |
+
|
20 |
+
- "bne"
|
21 |
+
|
22 |
widget:
|
23 |
- text: "El modelo del lenguaje GPT es capaz de"
|
24 |
- text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
|
25 |
|
26 |
---
|
27 |
|
28 |
+
# GPT2-base (gpt2-base-bne) trained with data from the National Library of Spain (BNE)
|
29 |
|
30 |
## Table of Contents
|
31 |
<details>
|
|
|
52 |
|
53 |
## Overview
|
54 |
|
55 |
+
- **Architecture:** gpt2-base
|
56 |
- **Language:** Spanish
|
57 |
- **Task:** text-generation
|
58 |
- **Data:** BNE
|
|
|
100 |
|
101 |
## Limitations and bias
|
102 |
|
103 |
+
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. Nevertheless, here's an example of how the model can have biased predictions:
|
|
|
104 |
|
105 |
```python
|
106 |
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
|
|
|
131 |
|
132 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
133 |
|
134 |
+
To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among others, sentence splitting, language detection, filtering of bad-formed sentences, and deduplication of repetitive contents. During the process, document boundaries are kept. This resulted in 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting in 570GB of text.
|
135 |
|
136 |
Some of the statistics of the corpus:
|
137 |
|