siddartha-abacus
commited on
Commit
•
d11fb56
1
Parent(s):
a507cd2
Update README.md
Browse files
README.md
CHANGED
@@ -131,4 +131,20 @@ Meta-Llama-3-70B-Instruct 9.006250
|
|
131 |
| GPT-4-Turbo | 9.38 | 9.00 | 9.19 |
|
132 |
| Meta-Llama-3-70B-Instruct | 9.21 | 8.80 | 9.01 |
|
133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228.
|
|
|
131 |
| GPT-4-Turbo | 9.38 | 9.00 | 9.19 |
|
132 |
| Meta-Llama-3-70B-Instruct | 9.21 | 8.80 | 9.01 |
|
133 |
|
134 |
+
### OpenLLM Leaderboard Manual Evaluation
|
135 |
+
|
136 |
+
| Model | ARC | Hellaswag | MMLU | TruthfulQA | Winogrande | GSM8K* |
|
137 |
+
| :---- | ---: | ------: | ---: | ---: | ---: | ---: |
|
138 |
+
| Smaug-Llama-3-70B-Instruct | 70.5 | 86.1 | 79.2 | 62.5 | 83.5 | 90.5 |
|
139 |
+
| Llama-3-70B-Instruct | 71.4 | 85.7 | 80.1 | 61.8 | 82.9 | 91.1 |
|
140 |
+
|
141 |
+
**GSM8K** The GSM8K numbers quoted here are computed using a recent release
|
142 |
+
of the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/).
|
143 |
+
The commit used by the leaderboard has a significant issue that impacts models that
|
144 |
+
tend to use `:` in their responses due to a bug in the stop word configuration for
|
145 |
+
GSM8K. The issue is discussed in more detail at [GSM8K eval issue](http://fixme).
|
146 |
+
The score for both Llama-3 and this model are significantly different when evaluated
|
147 |
+
with the updated harness as the issue with stop words has been addressed.
|
148 |
+
|
149 |
+
|
150 |
This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228.
|