clincolnoz commited on
Commit
b984eee
1 Parent(s): aef08c5

updated README

Browse files
Files changed (1) hide show
  1. README.md +29 -21
README.md CHANGED
@@ -8,6 +8,8 @@ tags:
8
  - not-for-all-audiences
9
  ---
10
 
 
 
11
  # notSexistBERT base model (uncased)
12
 
13
  Re-pretrained model on English language using a Masked Language Modeling (MLM)
@@ -98,8 +100,14 @@ Here is how to use this model to get the features of a given text in PyTorch:
98
 
99
  ```python
100
  from transformers import BertTokenizer, BertModel
101
- tokenizer = BertTokenizer.from_pretrained('clincolnoz/notSexistBERT_temp')
102
- model = BertModel.from_pretrained("clincolnoz/notSexistBERT_temp")
 
 
 
 
 
 
103
  text = "Replace me by any text you'd like."
104
  encoded_input = tokenizer(text, return_tensors='pt')
105
  output = model(**encoded_input)
@@ -109,8 +117,15 @@ and in TensorFlow:
109
 
110
  ```python
111
  from transformers import BertTokenizer, TFBertModel
112
- tokenizer = BertTokenizer.from_pretrained('clincolnoz/notSexistBERT_temp')
113
- model = TFBertModel.from_pretrained("clincolnoz/notSexistBERT_temp", from_pt=True)
 
 
 
 
 
 
 
114
  text = "Replace me by any text you'd like."
115
  encoded_input = tokenizer(text, return_tensors='tf')
116
  output = model(encoded_input)
@@ -188,7 +203,7 @@ headers). -->
188
  For the NSP task the data were preprocessed by splitting documents into sentences to create first a bag of sentences and then to create pairs of sentences, where Sentence B either corresponded to a consecutive sentence in the text or randomly select from the bag. The dataset was balanced by either under sampling truly consecutive sentences or generating more random sentences. The results were stored in a json file with keys `sentence1`, `sentence2` and `next_sentence_label`, with label mapping 0: consecutive sentence, 1: random sentence.
189
 
190
  The texts are lowercased and tokenized using WordPiece and a vocabulary size of
191
- 30,124. The inputs of the model are then of the form:
192
 
193
  ```
194
  [CLS] Sentence A [SEP] Sentence B [SEP]
@@ -227,24 +242,17 @@ Glue test results:
227
  | :---: | :---------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
228
  | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 | --> |
229
 
 
 
 
 
 
 
 
230
  <!-- ### BibTeX entry and citation info -->
231
 
232
  <!-- ```bibtex
233
- @article{DBLP:journals/corr/abs-1810-04805,
234
- author = {Jacob Devlin and
235
- Ming{-}Wei Chang and
236
- Kenton Lee and
237
- Kristina Toutanova},
238
- title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
239
- Understanding},
240
- journal = {CoRR},
241
- volume = {abs/1810.04805},
242
- year = {2018},
243
- url = {http://arxiv.org/abs/1810.04805},
244
- archivePrefix = {arXiv},
245
- eprint = {1810.04805},
246
- timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
247
- biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
248
- bibsource = {dblp computer science bibliography, https://dblp.org}
249
  }
250
  ``` -->
 
8
  - not-for-all-audiences
9
  ---
10
 
11
+ **WARNING: Some language produced by this model and README may offend. The model intent is to facilitate bias in AI research**
12
+
13
  # notSexistBERT base model (uncased)
14
 
15
  Re-pretrained model on English language using a Masked Language Modeling (MLM)
 
100
 
101
  ```python
102
  from transformers import BertTokenizer, BertModel
103
+ tokenizer = BertTokenizer.from_pretrained(
104
+ 'clincolnoz/notSexistBERT_temp',
105
+ revision='v0.34' # tag name, or branch name, or commit hash
106
+ )
107
+ model = BertModel.from_pretrained(
108
+ 'clincolnoz/notSexistBERT_temp',
109
+ revision='v0.34' # tag name, or branch name, or commit hash
110
+ )
111
  text = "Replace me by any text you'd like."
112
  encoded_input = tokenizer(text, return_tensors='pt')
113
  output = model(**encoded_input)
 
117
 
118
  ```python
119
  from transformers import BertTokenizer, TFBertModel
120
+ tokenizer = BertTokenizer.from_pretrained(
121
+ 'clincolnoz/notSexistBERT_temp',
122
+ revision='v0.34' # tag name, or branch name, or commit hash
123
+ )
124
+ model = TFBertModel.from_pretrained(
125
+ 'clincolnoz/notSexistBERT_temp',
126
+ from_pt=True,
127
+ revision='v0.34' # tag name, or branch name, or commit hash
128
+ )
129
  text = "Replace me by any text you'd like."
130
  encoded_input = tokenizer(text, return_tensors='tf')
131
  output = model(encoded_input)
 
203
  For the NSP task the data were preprocessed by splitting documents into sentences to create first a bag of sentences and then to create pairs of sentences, where Sentence B either corresponded to a consecutive sentence in the text or randomly select from the bag. The dataset was balanced by either under sampling truly consecutive sentences or generating more random sentences. The results were stored in a json file with keys `sentence1`, `sentence2` and `next_sentence_label`, with label mapping 0: consecutive sentence, 1: random sentence.
204
 
205
  The texts are lowercased and tokenized using WordPiece and a vocabulary size of
206
+ 30,646. The inputs of the model are then of the form:
207
 
208
  ```
209
  [CLS] Sentence A [SEP] Sentence B [SEP]
 
242
  | :---: | :---------: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
243
  | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 | --> |
244
 
245
+ ### Framework versions
246
+
247
+ - Transformers 4.27.0.dev0
248
+ - Pytorch 1.13.1+cu117
249
+ - Datasets 2.9.0
250
+ - Tokenizers 0.13.2
251
+
252
  <!-- ### BibTeX entry and citation info -->
253
 
254
  <!-- ```bibtex
255
+ @article{
256
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
257
  }
258
  ``` -->