davda54 commited on
Commit
937e8cf
1 Parent(s): ac61a16

Add evaluation of Mistral

Browse files
Files changed (1) hide show
  1. README.md +57 -41
README.md CHANGED
@@ -82,41 +82,10 @@ _____
82
  *Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
83
  The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
84
 
85
- The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 5.95 and the final training perplexity is 3.61.
86
 
87
  Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
88
- We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).
89
-
90
-
91
- ### Reading comprehension
92
-
93
- [NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
94
-
95
- <details>
96
- <summary>Method</summary>
97
-
98
- * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
99
- * Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
100
- * Few-shot results show the average scores across 5 repetitions
101
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
102
- * Performance metrics: macro-averaged F1-score and exact match (EM).
103
-
104
- </details>
105
-
106
- <details open>
107
- <summary>Performance results on the extractive question answering task (NorQuAD)</summary>
108
-
109
- |Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
110
- |---|---|---|---|
111
- |NorMistral-7b-warm|**48.6**/**24.8**|**63.6**/**40.0**|**66.5**/43.8|
112
- |NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
113
- |NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
114
- |NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
115
- |Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
116
- |GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
117
- |GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/**44.5**|
118
-
119
- </details>
120
 
121
 
122
  ### Sentiment analysis
@@ -144,12 +113,48 @@ We use the binary formulation of this task (positive vs. negative).
144
  |NorMistral-7b-scratch|47.3|62.2|80.1|
145
  |NorBLOOM-7b|**75.7**|73.8|65.5|
146
  |NB-GPT-J|48.4|56.5|65.2|
147
- |Falcon-7B|53.3|61.6|74.9|
148
  |GPT-Sw3-6.7B|61.5|72.2|76.5|
149
  |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  </details>
152
 
 
 
153
  ### Machine translation
154
 
155
  [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
@@ -158,7 +163,7 @@ We use the binary formulation of this task (positive vs. negative).
158
  <summary>Method</summary>
159
 
160
  * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
161
- * Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```.
162
  * Few-shot results show the average scores across 5 repetitions
163
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
164
  * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
@@ -174,9 +179,11 @@ We use the binary formulation of this task (positive vs. negative).
174
  |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
175
  |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
176
  |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
177
- |Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
178
  |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
179
  |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
 
 
 
180
 
181
  </details>
182
 
@@ -189,9 +196,11 @@ We use the binary formulation of this task (positive vs. negative).
189
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
190
  |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
191
  |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
192
- |Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
193
  |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
194
  |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
 
 
 
195
 
196
  </details>
197
 
@@ -205,9 +214,11 @@ We use the binary formulation of this task (positive vs. negative).
205
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
206
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
207
  |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
208
- |Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
209
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
210
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 
 
 
211
 
212
  </details>
213
 
@@ -220,9 +231,10 @@ We use the binary formulation of this task (positive vs. negative).
220
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
221
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
222
  |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
223
- |Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
224
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
225
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 
 
226
 
227
  </details>
228
 
@@ -236,9 +248,11 @@ We use the binary formulation of this task (positive vs. negative).
236
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
237
  |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
238
  |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
239
- |Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
240
  |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
241
  |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
 
 
 
242
 
243
  </details>
244
 
@@ -251,13 +265,15 @@ We use the binary formulation of this task (positive vs. negative).
251
  |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
252
  |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
253
  |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
254
- |Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
255
  |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
256
  |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
 
 
257
 
258
  </details>
259
 
260
 
 
261
  _____
262
  ## Hardware and Software
263
 
 
82
  *Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
83
  The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
84
 
85
+ The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
86
 
87
  Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
88
+ We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
 
91
  ### Sentiment analysis
 
113
  |NorMistral-7b-scratch|47.3|62.2|80.1|
114
  |NorBLOOM-7b|**75.7**|73.8|65.5|
115
  |NB-GPT-J|48.4|56.5|65.2|
 
116
  |GPT-Sw3-6.7B|61.5|72.2|76.5|
117
  |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
118
+ |Falcon-7B|53.3|61.6|74.9|
119
+ |Mistral-7B-v0.1|70.2|72.9|84.8|
120
+
121
+ </details>
122
+
123
+
124
+
125
+ ### Reading comprehension
126
+
127
+ [NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
128
+
129
+ <details>
130
+ <summary>Method</summary>
131
+
132
+ * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
133
+ * Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"``` Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
134
+ * Few-shot results show the average scores across 5 repetitions
135
+ * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
136
+ * Performance metrics: macro-averaged F1-score and exact match (EM).
137
+
138
+ </details>
139
+
140
+ <details open>
141
+ <summary>Performance results on the extractive question answering task (NorQuAD)</summary>
142
+
143
+ |Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
144
+ |---|---|---|---|
145
+ |NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
146
+ |NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
147
+ |NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
148
+ |NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
149
+ |GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
150
+ |GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
151
+ |Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
152
+ |Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
153
 
154
  </details>
155
 
156
+
157
+
158
  ### Machine translation
159
 
160
  [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
 
163
  <summary>Method</summary>
164
 
165
  * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
166
+ * Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
167
  * Few-shot results show the average scores across 5 repetitions
168
  * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
169
  * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
 
179
  |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
180
  |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
181
  |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
 
182
  |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
183
  |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
184
+ |Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
185
+ |Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
186
+
187
 
188
  </details>
189
 
 
196
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
197
  |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
198
  |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
 
199
  |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
200
  |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
201
+ |Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
202
+ |Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
203
+
204
 
205
  </details>
206
 
 
214
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
215
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
216
  |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
 
217
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
218
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
219
+ |Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
220
+ |Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
221
+
222
 
223
  </details>
224
 
 
231
  |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
232
  |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
233
  |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
 
234
  |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
235
  |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
236
+ |Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
237
+ |Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
238
 
239
  </details>
240
 
 
248
  |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
249
  |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
250
  |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
 
251
  |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
252
  |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
253
+ |Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
254
+ |Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
255
+
256
 
257
  </details>
258
 
 
265
  |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
266
  |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
267
  |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
 
268
  |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
269
  |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
270
+ |Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
271
+ |Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
272
 
273
  </details>
274
 
275
 
276
+
277
  _____
278
  ## Hardware and Software
279