clincolnoz
commited on
Commit
•
b984eee
1
Parent(s):
aef08c5
updated README
Browse files
README.md
CHANGED
@@ -8,6 +8,8 @@ tags:
|
|
8 |
- not-for-all-audiences
|
9 |
---
|
10 |
|
|
|
|
|
11 |
# notSexistBERT base model (uncased)
|
12 |
|
13 |
Re-pretrained model on English language using a Masked Language Modeling (MLM)
|
@@ -98,8 +100,14 @@ Here is how to use this model to get the features of a given text in PyTorch:
|
|
98 |
|
99 |
```python
|
100 |
from transformers import BertTokenizer, BertModel
|
101 |
-
tokenizer = BertTokenizer.from_pretrained(
|
102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
text = "Replace me by any text you'd like."
|
104 |
encoded_input = tokenizer(text, return_tensors='pt')
|
105 |
output = model(**encoded_input)
|
@@ -109,8 +117,15 @@ and in TensorFlow:
|
|
109 |
|
110 |
```python
|
111 |
from transformers import BertTokenizer, TFBertModel
|
112 |
-
tokenizer = BertTokenizer.from_pretrained(
|
113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
text = "Replace me by any text you'd like."
|
115 |
encoded_input = tokenizer(text, return_tensors='tf')
|
116 |
output = model(encoded_input)
|
@@ -188,7 +203,7 @@ headers). -->
|
|
188 |
For the NSP task the data were preprocessed by splitting documents into sentences to create first a bag of sentences and then to create pairs of sentences, where Sentence B either corresponded to a consecutive sentence in the text or randomly select from the bag. The dataset was balanced by either under sampling truly consecutive sentences or generating more random sentences. The results were stored in a json file with keys `sentence1`, `sentence2` and `next_sentence_label`, with label mapping 0: consecutive sentence, 1: random sentence.
|
189 |
|
190 |
The texts are lowercased and tokenized using WordPiece and a vocabulary size of
|
191 |
-
30,
|
192 |
|
193 |
```
|
194 |
[CLS] Sentence A [SEP] Sentence B [SEP]
|
@@ -227,24 +242,17 @@ Glue test results:
|
|
227 |
| :---: | :---------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
|
228 |
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 | --> |
|
229 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
230 |
<!-- ### BibTeX entry and citation info -->
|
231 |
|
232 |
<!-- ```bibtex
|
233 |
-
@article{
|
234 |
-
|
235 |
-
Ming{-}Wei Chang and
|
236 |
-
Kenton Lee and
|
237 |
-
Kristina Toutanova},
|
238 |
-
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
|
239 |
-
Understanding},
|
240 |
-
journal = {CoRR},
|
241 |
-
volume = {abs/1810.04805},
|
242 |
-
year = {2018},
|
243 |
-
url = {http://arxiv.org/abs/1810.04805},
|
244 |
-
archivePrefix = {arXiv},
|
245 |
-
eprint = {1810.04805},
|
246 |
-
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
|
247 |
-
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
|
248 |
-
bibsource = {dblp computer science bibliography, https://dblp.org}
|
249 |
}
|
250 |
``` -->
|
|
|
8 |
- not-for-all-audiences
|
9 |
---
|
10 |
|
11 |
+
**WARNING: Some language produced by this model and README may offend. The model intent is to facilitate bias in AI research**
|
12 |
+
|
13 |
# notSexistBERT base model (uncased)
|
14 |
|
15 |
Re-pretrained model on English language using a Masked Language Modeling (MLM)
|
|
|
100 |
|
101 |
```python
|
102 |
from transformers import BertTokenizer, BertModel
|
103 |
+
tokenizer = BertTokenizer.from_pretrained(
|
104 |
+
'clincolnoz/notSexistBERT_temp',
|
105 |
+
revision='v0.34' # tag name, or branch name, or commit hash
|
106 |
+
)
|
107 |
+
model = BertModel.from_pretrained(
|
108 |
+
'clincolnoz/notSexistBERT_temp',
|
109 |
+
revision='v0.34' # tag name, or branch name, or commit hash
|
110 |
+
)
|
111 |
text = "Replace me by any text you'd like."
|
112 |
encoded_input = tokenizer(text, return_tensors='pt')
|
113 |
output = model(**encoded_input)
|
|
|
117 |
|
118 |
```python
|
119 |
from transformers import BertTokenizer, TFBertModel
|
120 |
+
tokenizer = BertTokenizer.from_pretrained(
|
121 |
+
'clincolnoz/notSexistBERT_temp',
|
122 |
+
revision='v0.34' # tag name, or branch name, or commit hash
|
123 |
+
)
|
124 |
+
model = TFBertModel.from_pretrained(
|
125 |
+
'clincolnoz/notSexistBERT_temp',
|
126 |
+
from_pt=True,
|
127 |
+
revision='v0.34' # tag name, or branch name, or commit hash
|
128 |
+
)
|
129 |
text = "Replace me by any text you'd like."
|
130 |
encoded_input = tokenizer(text, return_tensors='tf')
|
131 |
output = model(encoded_input)
|
|
|
203 |
For the NSP task the data were preprocessed by splitting documents into sentences to create first a bag of sentences and then to create pairs of sentences, where Sentence B either corresponded to a consecutive sentence in the text or randomly select from the bag. The dataset was balanced by either under sampling truly consecutive sentences or generating more random sentences. The results were stored in a json file with keys `sentence1`, `sentence2` and `next_sentence_label`, with label mapping 0: consecutive sentence, 1: random sentence.
|
204 |
|
205 |
The texts are lowercased and tokenized using WordPiece and a vocabulary size of
|
206 |
+
30,646. The inputs of the model are then of the form:
|
207 |
|
208 |
```
|
209 |
[CLS] Sentence A [SEP] Sentence B [SEP]
|
|
|
242 |
| :---: | :---------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
|
243 |
| | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 | --> |
|
244 |
|
245 |
+
### Framework versions
|
246 |
+
|
247 |
+
- Transformers 4.27.0.dev0
|
248 |
+
- Pytorch 1.13.1+cu117
|
249 |
+
- Datasets 2.9.0
|
250 |
+
- Tokenizers 0.13.2
|
251 |
+
|
252 |
<!-- ### BibTeX entry and citation info -->
|
253 |
|
254 |
<!-- ```bibtex
|
255 |
+
@article{
|
256 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
257 |
}
|
258 |
``` -->
|