abacusai
/

Smaug-Llama-3-70B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

siddartha-abacus commited on Jun 3

Commit

d11fb56

•

1 Parent(s): a507cd2

Update README.md

Files changed (1) hide show

README.md +16 -0

README.md CHANGED Viewed

@@ -131,4 +131,20 @@ Meta-Llama-3-70B-Instruct           9.006250
 | GPT-4-Turbo | 9.38 |  9.00 | 9.19 |
 | Meta-Llama-3-70B-Instruct | 9.21 |  8.80 | 9.01 |
 This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228.

 | GPT-4-Turbo | 9.38 |  9.00 | 9.19 |
 | Meta-Llama-3-70B-Instruct | 9.21 |  8.80 | 9.01 |
+### OpenLLM Leaderboard Manual Evaluation
+| Model | ARC  | Hellaswag | MMLU | TruthfulQA | Winogrande | GSM8K* |
+| :---- | ---: | ------:   | ---: | ---:       | ---:       | ---:   |
+| Smaug-Llama-3-70B-Instruct | 70.5 | 86.1 | 79.2 | 62.5 | 83.5 | 90.5 |
+| Llama-3-70B-Instruct | 71.4 | 85.7 | 80.1 | 61.8 | 82.9 | 91.1 |
+**GSM8K** The GSM8K numbers quoted here are computed using a recent release
+of the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/).
+The commit used by the leaderboard has a significant issue that impacts models that
+tend to use `:` in their responses due to a bug in the stop word configuration for
+GSM8K. The issue is discussed in more detail at [GSM8K eval issue](http://fixme).
+The score for both Llama-3 and this model are significantly different when evaluated
+with the updated harness as the issue with stop words has been addressed.
 This version of Smaug uses new techniques and new data compared to [Smaug-72B](https://huggingface.co/abacusai/Smaug-72B-v0.1), and more information will be released later on. For now, see the previous Smaug paper: https://arxiv.org/abs/2402.13228.