NickyHavoc
commited on
Commit
•
5fc77c4
1
Parent(s):
4e8e5d5
Include general knowledge benchmarks (#4)
Browse files- Include general knowledge benchmarks (9338a66ad980e1b585a2c45da4a29dad51b2b869)
README.md
CHANGED
@@ -235,12 +235,11 @@ While performing in the same ballpark as `llama-3.1-8b-instruct`, `Pharia-1-LLM-
|
|
235 |
| | | | | |
|
236 |
| --- | --- | --- | --- | --- |
|
237 |
| **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
|
238 |
-
| `llama-3.1-8b-instruct` | **3.62** |
|
239 |
| `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
|
|
|
240 |
| `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
|
241 |
|
242 |
-
**Note:** We will add the engineering benchmark evaluations for `Pharia-1-LLM-7B-control-aligned` shortly.
|
243 |
-
|
244 |
#### Performance on length-controlled completions
|
245 |
|
246 |
“Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
|
@@ -278,7 +277,35 @@ We assessed each model’s ability to produce safe answers given prompts that te
|
|
278 |
|
279 |
### General Knowledge Benchmarks
|
280 |
|
281 |
-
We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
282 |
|
283 |
# Training Details
|
284 |
|
|
|
235 |
| | | | | |
|
236 |
| --- | --- | --- | --- | --- |
|
237 |
| **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
|
238 |
+
| `llama-3.1-8b-instruct` | **3.62** | 4.01 | 89.7 | **83.6** |
|
239 |
| `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
|
240 |
+
| `Pharia-1-LLM-7B-control-aligned` | 3.51 | **4.08** | 81.8 | 77.7 |
|
241 |
| `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
|
242 |
|
|
|
|
|
243 |
#### Performance on length-controlled completions
|
244 |
|
245 |
“Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
|
|
|
277 |
|
278 |
### General Knowledge Benchmarks
|
279 |
|
280 |
+
We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) provide a reproducible comparability of model performance, they have been designed for evaluation of pre-trained models and should not be mistaken for strong indicators of use-case-specific performance. In contrast to what [some research](https://arxiv.org/abs/2405.00332) might suggest for other models, our Pharia-1-LLM-7B models have not been tailored to such generic benchmarks, and naturally would be expected to underperform in these.
|
281 |
+
|
282 |
+
| **Benchmark** | **Shots** | **Metric** | **Pharia-1-LLM-7B-control** | **Pharia-1-LLM-7B-control-aligned** | **Llama-3.1-8B-Instruct** | **Mistral-7B-Instruct-v0.3** |
|
283 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
284 |
+
| 1. **General Knowledge:** [**Open LLM Leaderboard V1**](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) | | | | | | |
|
285 |
+
| ARC-Challenge | 25 | **acc\_norm** | `0.546` | `0.528` | `0.563` | `0.613` |
|
286 |
+
| TruthfulQA | 6 | **prob\_mass** | `0.547` | `0.566` | `0.542` | `0.635` |
|
287 |
+
| GSM8K | 5 | **acc** | `0.014` | `0.163` | `0.573` | `0.488` |
|
288 |
+
| MMLU | 5 | **acc** | `0.484` | `0.525` | `0.659` | `0.624` |
|
289 |
+
| HellaSwag | 10 | **acc\_norm** | `0.646` | `0.761` | `0.779` | `0.826` |
|
290 |
+
| Winogrande | 5 | **acc** | `0.651` | `0.643` | `0.732` | `0.784` |
|
291 |
+
| 2. **General Knowledge: Multilingual** | | | | | | |
|
292 |
+
| Lambada Multilingual: en, fr, de, it, es | 10 | **acc** | `0.340` | `0.525` | `0.540` | `0.589` |
|
293 |
+
| ARC-Challenge-DE | 25 | **acc\_norm** | `0.486` | `0.486` | `0.459` | `0.475` |
|
294 |
+
| HellaSwag-DE | 10 | **acc\_norm** | `0.487` | `0.633` | `0.598` | `0.583` |
|
295 |
+
| MMLU-DE | 5 | **acc** | `0.428` | `0.488` | `0.589` | `0.537` |
|
296 |
+
| TruthfulQA-DE | 6 | **prob\_mass** | `0.561` | `0.576` | `0.509` | `0.623` |
|
297 |
+
| 3. **Translation** | | | | | | |
|
298 |
+
| WMT14 | 5 | **bleu, chrf, ter** | `32.66`, `61.32`, `53.77` | `33.07`, `61.73`, `53.14` | `35.77`, `63.08`, `50.02` | `33.29`, `61.49`, `52.56` |
|
299 |
+
| WMT16 | 5 | **bleu, chrf, ter** | `30.59`, `60.36`, `56.62` | `31.64`, `61.18`, `55.48` | `34.24`, `62.69`, `51.95` | `31.13`, `60.34`, `56.25` |
|
300 |
+
| WMT20 | 5 | **bleu, chrf, ter** | `26.60`, `58.57`, `63.09` | `26.65`, `58.82`, `63.37` | `28.12`, `59.60`, `59.73` | `26.32`, `58.06`, `61.81` |
|
301 |
+
| 4. **Expert Domain: Law** | | | | | | |
|
302 |
+
| Legal-Sentence-Classification-Dataset | 5 | **acc** | `0.315` | `0.357` | `0.424` | `0.418` |
|
303 |
+
| LexGlue Case-Hold | 5 | **acc\_norm** | `0.268` | `0.282` | `0.297` | `0.303` |
|
304 |
+
| MMLU Law | 5 | **acc** | `0.465` | `0.524` | `0.689` | `0.674` |
|
305 |
+
| MMLU-DE Law | 5 | **acc** | `0.439` | `0.516` | `0.626` | `0.560` |
|
306 |
+
| 5. **Expert Domain: Engineering** | | | | | | |
|
307 |
+
| MMLU Engineering | 5 | **acc** | `0.401` | `0.431` | `0.624` | `0.595` |
|
308 |
+
| MMLU-DE Engineering | 5 | **acc** | `0.389` | `0.426` | `0.529` | `0.533` |
|
309 |
|
310 |
# Training Details
|
311 |
|