timpal0l commited on
Commit
81e3fd2
1 Parent(s): 53b86b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md CHANGED
@@ -1,3 +1,129 @@
1
  ---
 
 
 
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - sv
4
+ - 'no'
5
+ - da
6
+ - en
7
  license: mit
8
+ tags:
9
+ - bert
10
+ - roberta
11
+ pipeline_tag: fill-mask
12
+ widget:
13
+ - text: Huvudstaden i Sverige är <mask>.
14
+ example_title: Swedish
15
+ - text: Hovedstaden i Norge er <mask>.
16
+ example_title: Norwegian
17
+ - text: Danmarks hovedstad er <mask>.
18
+ example_title: Danish
19
  ---
20
+
21
+ # roberta-large-1160k
22
+
23
+ ## Intended uses
24
+
25
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
26
+
27
+ ### How to use
28
+
29
+ You can use this model directly with a pipeline for masked language modeling:
30
+
31
+ ```python
32
+ >>> from transformers import pipeline
33
+ >>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-550k')
34
+ >>> unmasker("Huvudstaden i Sverige är <mask>.")
35
+ [{'score': 0.5841221213340759,
36
+ 'token': 1945,
37
+ 'token_str': ' Stockholm',
38
+ 'sequence': 'Huvudstaden i Sverige är Stockholm.'},
39
+ {'score': 0.06775698810815811,
40
+ 'token': 5007,
41
+ 'token_str': ' Göteborg',
42
+ 'sequence': 'Huvudstaden i Sverige är Göteborg.'},
43
+ {'score': 0.05057400465011597,
44
+ 'token': 5761,
45
+ 'token_str': ' Malmö',
46
+ 'sequence': 'Huvudstaden i Sverige är Malmö.'},
47
+ {'score': 0.021936343982815742,
48
+ 'token': 21449,
49
+ 'token_str': ' Norrköping',
50
+ 'sequence': 'Huvudstaden i Sverige är Norrköping.'},
51
+ {'score': 0.017798304557800293,
52
+ 'token': 5658,
53
+ 'token_str': ' Uppsala',
54
+ 'sequence': 'Huvudstaden i Sverige är Uppsala.'}]
55
+ ```
56
+ ```python
57
+ >>> unmasker("Hovedstaden i Norge er <mask>.")
58
+ [{'score': 0.6792309284210205,
59
+ 'token': 5158,
60
+ 'token_str': ' Oslo',
61
+ 'sequence': 'Hovedstaden i Norge er Oslo.'},
62
+ {'score': 0.09379775077104568,
63
+ 'token': 15456,
64
+ 'token_str': ' Trondheim',
65
+ 'sequence': 'Hovedstaden i Norge er Trondheim.'},
66
+ {'score': 0.052535850554704666,
67
+ 'token': 11370,
68
+ 'token_str': ' Bergen',
69
+ 'sequence': 'Hovedstaden i Norge er Bergen.'},
70
+ {'score': 0.03465486690402031,
71
+ 'token': 29407,
72
+ 'token_str': ' hovedstaden',
73
+ 'sequence': 'Hovedstaden i Norge er hovedstaden.'},
74
+ {'score': 0.03017985075712204,
75
+ 'token': 33311,
76
+ 'token_str': ' Kristiansand',
77
+ 'sequence': 'Hovedstaden i Norge er Kristiansand.'}]
78
+ ```
79
+ ```python
80
+ >>> unmasker("Danmarks hovedstad er <mask>.")
81
+ [{'score': 0.11624140292406082,
82
+ 'token': 4794,
83
+ 'token_str': ' København',
84
+ 'sequence': 'Danmarks hovedstad er København.'},
85
+ {'score': 0.045051511377096176,
86
+ 'token': 7680,
87
+ 'token_str': ' død',
88
+ 'sequence': 'Danmarks hovedstad er død.'},
89
+ {'score': 0.02936543896794319,
90
+ 'token': 10795,
91
+ 'token_str': ' lukket',
92
+ 'sequence': 'Danmarks hovedstad er lukket.'},
93
+ {'score': 0.026030730456113815,
94
+ 'token': 13580,
95
+ 'token_str': ' Odense',
96
+ 'sequence': 'Danmarks hovedstad er Odense.'},
97
+ {'score': 0.02130937948822975,
98
+ 'token': 16347,
99
+ 'token_str': ' Roskilde',
100
+ 'sequence': 'Danmarks hovedstad er Roskilde.'}]
101
+ ```
102
+
103
+ Here is how to use this model to get the features of a given text in PyTorch:
104
+
105
+ ```python
106
+ from transformers import RobertaTokenizer, RobertaModel
107
+ tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-550k')
108
+ model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-550k')
109
+ text = "Replace me by any text you'd like."
110
+ encoded_input = tokenizer(text, return_tensors='pt')
111
+ output = model(**encoded_input)
112
+ ```
113
+
114
+ ## Training data
115
+ The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples.
116
+
117
+ ## Training procedure
118
+
119
+ The model was trained with the [optimum-habana](https://github.com/huggingface/optimum-habana) framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB.
120
+
121
+ The weights from https://huggingface.co/FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch.
122
+
123
+ This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs.
124
+
125
+ A batch size of 1536 was used.
126
+
127
+ ## Evaluation results
128
+
129
+ When fine-tuned on downstream tasks, this model achieves the following results: