Add evaluation of Mistral
Browse files
README.md
CHANGED
@@ -82,41 +82,10 @@ _____
|
|
82 |
*Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
|
83 |
The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
|
84 |
|
85 |
-
The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is
|
86 |
|
87 |
Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
|
88 |
-
We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).
|
89 |
-
|
90 |
-
|
91 |
-
### Reading comprehension
|
92 |
-
|
93 |
-
[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
|
94 |
-
|
95 |
-
<details>
|
96 |
-
<summary>Method</summary>
|
97 |
-
|
98 |
-
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
99 |
-
* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
|
100 |
-
* Few-shot results show the average scores across 5 repetitions
|
101 |
-
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
|
102 |
-
* Performance metrics: macro-averaged F1-score and exact match (EM).
|
103 |
-
|
104 |
-
</details>
|
105 |
-
|
106 |
-
<details open>
|
107 |
-
<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
|
108 |
-
|
109 |
-
|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
|
110 |
-
|---|---|---|---|
|
111 |
-
|NorMistral-7b-warm|**48.6**/**24.8**|**63.6**/**40.0**|**66.5**/43.8|
|
112 |
-
|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
|
113 |
-
|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
|
114 |
-
|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
|
115 |
-
|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
|
116 |
-
|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
|
117 |
-
|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/**44.5**|
|
118 |
-
|
119 |
-
</details>
|
120 |
|
121 |
|
122 |
### Sentiment analysis
|
@@ -144,12 +113,48 @@ We use the binary formulation of this task (positive vs. negative).
|
|
144 |
|NorMistral-7b-scratch|47.3|62.2|80.1|
|
145 |
|NorBLOOM-7b|**75.7**|73.8|65.5|
|
146 |
|NB-GPT-J|48.4|56.5|65.2|
|
147 |
-
|Falcon-7B|53.3|61.6|74.9|
|
148 |
|GPT-Sw3-6.7B|61.5|72.2|76.5|
|
149 |
|GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
|
151 |
</details>
|
152 |
|
|
|
|
|
153 |
### Machine translation
|
154 |
|
155 |
[Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
|
@@ -158,7 +163,7 @@ We use the binary formulation of this task (positive vs. negative).
|
|
158 |
<summary>Method</summary>
|
159 |
|
160 |
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
161 |
-
* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```.
|
162 |
* Few-shot results show the average scores across 5 repetitions
|
163 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
|
164 |
* Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
|
@@ -174,9 +179,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
174 |
|NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
|
175 |
|NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
|
176 |
|NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
|
177 |
-
|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
|
178 |
|GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
|
179 |
|GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
|
|
|
|
|
|
|
180 |
|
181 |
</details>
|
182 |
|
@@ -189,9 +196,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
189 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
190 |
|NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
|
191 |
|NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
|
192 |
-
|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
|
193 |
|GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
|
194 |
|GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
|
|
|
|
|
|
|
195 |
|
196 |
</details>
|
197 |
|
@@ -205,9 +214,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
205 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
206 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
207 |
|NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
|
208 |
-
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
|
209 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
210 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
|
|
|
|
|
|
211 |
|
212 |
</details>
|
213 |
|
@@ -220,9 +231,10 @@ We use the binary formulation of this task (positive vs. negative).
|
|
220 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
221 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
222 |
|NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
|
223 |
-
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
|
224 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
225 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
|
|
|
|
226 |
|
227 |
</details>
|
228 |
|
@@ -236,9 +248,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
236 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
237 |
|NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
|
238 |
|NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
|
239 |
-
|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
|
240 |
|GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
|
241 |
|GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
|
|
|
|
|
|
|
242 |
|
243 |
</details>
|
244 |
|
@@ -251,13 +265,15 @@ We use the binary formulation of this task (positive vs. negative).
|
|
251 |
|NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
|
252 |
|NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
|
253 |
|NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
|
254 |
-
|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
|
255 |
|GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
|
256 |
|GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
|
|
|
|
|
257 |
|
258 |
</details>
|
259 |
|
260 |
|
|
|
261 |
_____
|
262 |
## Hardware and Software
|
263 |
|
|
|
82 |
*Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
|
83 |
The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
|
84 |
|
85 |
+
The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
|
86 |
|
87 |
Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
|
88 |
+
We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
|
91 |
### Sentiment analysis
|
|
|
113 |
|NorMistral-7b-scratch|47.3|62.2|80.1|
|
114 |
|NorBLOOM-7b|**75.7**|73.8|65.5|
|
115 |
|NB-GPT-J|48.4|56.5|65.2|
|
|
|
116 |
|GPT-Sw3-6.7B|61.5|72.2|76.5|
|
117 |
|GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
|
118 |
+
|Falcon-7B|53.3|61.6|74.9|
|
119 |
+
|Mistral-7B-v0.1|70.2|72.9|84.8|
|
120 |
+
|
121 |
+
</details>
|
122 |
+
|
123 |
+
|
124 |
+
|
125 |
+
### Reading comprehension
|
126 |
+
|
127 |
+
[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
|
128 |
+
|
129 |
+
<details>
|
130 |
+
<summary>Method</summary>
|
131 |
+
|
132 |
+
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
133 |
+
* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"``` Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
|
134 |
+
* Few-shot results show the average scores across 5 repetitions
|
135 |
+
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
|
136 |
+
* Performance metrics: macro-averaged F1-score and exact match (EM).
|
137 |
+
|
138 |
+
</details>
|
139 |
+
|
140 |
+
<details open>
|
141 |
+
<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
|
142 |
+
|
143 |
+
|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
|
144 |
+
|---|---|---|---|
|
145 |
+
|NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
|
146 |
+
|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
|
147 |
+
|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
|
148 |
+
|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
|
149 |
+
|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
|
150 |
+
|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
|
151 |
+
|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
|
152 |
+
|Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
|
153 |
|
154 |
</details>
|
155 |
|
156 |
+
|
157 |
+
|
158 |
### Machine translation
|
159 |
|
160 |
[Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
|
|
|
163 |
<summary>Method</summary>
|
164 |
|
165 |
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
166 |
+
* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
|
167 |
* Few-shot results show the average scores across 5 repetitions
|
168 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
|
169 |
* Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
|
|
|
179 |
|NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
|
180 |
|NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
|
181 |
|NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
|
|
|
182 |
|GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
|
183 |
|GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
|
184 |
+
|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
|
185 |
+
|Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
|
186 |
+
|
187 |
|
188 |
</details>
|
189 |
|
|
|
196 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
197 |
|NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
|
198 |
|NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
|
|
|
199 |
|GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
|
200 |
|GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
|
201 |
+
|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
|
202 |
+
|Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
|
203 |
+
|
204 |
|
205 |
</details>
|
206 |
|
|
|
214 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
215 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
216 |
|NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
|
|
|
217 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
218 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
219 |
+
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
|
220 |
+
|Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
|
221 |
+
|
222 |
|
223 |
</details>
|
224 |
|
|
|
231 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
232 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
233 |
|NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
|
|
|
234 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
235 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
236 |
+
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
|
237 |
+
|Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
|
238 |
|
239 |
</details>
|
240 |
|
|
|
248 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
249 |
|NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
|
250 |
|NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
|
|
|
251 |
|GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
|
252 |
|GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
|
253 |
+
|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
|
254 |
+
|Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
|
255 |
+
|
256 |
|
257 |
</details>
|
258 |
|
|
|
265 |
|NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
|
266 |
|NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
|
267 |
|NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
|
|
|
268 |
|GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
|
269 |
|GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
|
270 |
+
|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
|
271 |
+
|Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
|
272 |
|
273 |
</details>
|
274 |
|
275 |
|
276 |
+
|
277 |
_____
|
278 |
## Hardware and Software
|
279 |
|