Update README.md
Browse files
README.md
CHANGED
@@ -67,14 +67,13 @@ In AlpacaEval, Rocket 🦝 achieves a near 80% win rate, coupled with an average
|
|
67 |
|
68 |
| Metric | Value |
|
69 |
|-----------------------|---------------------------|
|
70 |
-
| Avg. | 52.15 |
|
71 |
| ARC (25-shot) | 50.51 |
|
72 |
-
| HellaSwag (
|
73 |
-
| MMLU (5-shot) | 61.07 |
|
74 |
| TruthfulQA (mc2) (0-shot) | 54.38 |
|
75 |
-
|
|
|
|
76 |
| GSM8K (5-shot) | 37.91 |
|
77 |
-
|
|
78 |
|
79 |
|
80 |
## Intended uses & limitations
|
@@ -132,10 +131,13 @@ generated_text = model.generate(**inputs, max_length=3084, top_p=0.95, do_sample
|
|
132 |
```
|
133 |
|
134 |
## Bias, Risks, and Limitations
|
135 |
-
Unlike ChatGPT, which incorporates in-the-loop filtering of responses and is aligned during the RLHF phase for safe completions, our model lacks these features. Consequently, it may generate problematic outputs, particularly when prompted in certain ways.
|
136 |
|
137 |
The pretraining dataset is comprised of a filtered mixture of open-source large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets): Falcon RefinedWeb extract ([Penedo et al., 2023](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), RedPajama-Data ([Together Computer., 2023](https://github.com/togethercomputer/RedPajama-Data)) and The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)) both without the *Books3* subset, and StarCoder ([Li et al., 2023](https://arxiv.org/abs/2305.06161)).
|
138 |
|
|
|
|
|
|
|
139 |
|
140 |
**The model name is inspired by the small but formidable character from 'Guardians of the Galaxy'. Similar to its namesake, this model, with its 3 billion parameters, showcases remarkable efficiency and effectiveness, challenging larger models despite its smaller size."*
|
141 |
|
|
|
67 |
|
68 |
| Metric | Value |
|
69 |
|-----------------------|---------------------------|
|
|
|
70 |
| ARC (25-shot) | 50.51 |
|
71 |
+
| HellaSwag (0-shot) | 73.91 |
|
|
|
72 |
| TruthfulQA (mc2) (0-shot) | 54.38 |
|
73 |
+
| BoolQ (0-shot) | 81.71 |
|
74 |
+
| Winogrande (5-shot) | 67.8 |
|
75 |
| GSM8K (5-shot) | 37.91 |
|
76 |
+
| MathQA (5-shot) | 31.26 |
|
77 |
|
78 |
|
79 |
## Intended uses & limitations
|
|
|
131 |
```
|
132 |
|
133 |
## Bias, Risks, and Limitations
|
134 |
+
Unlike ChatGPT, which incorporates in-the-loop filtering of responses and is aligned during the RLHF phase for safe completions, our model lacks these features. Consequently, it may generate problematic outputs, particularly when prompted in certain ways. Below is the score of the model on Toxigen benchmark.
|
135 |
|
136 |
The pretraining dataset is comprised of a filtered mixture of open-source large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets): Falcon RefinedWeb extract ([Penedo et al., 2023](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), RedPajama-Data ([Together Computer., 2023](https://github.com/togethercomputer/RedPajama-Data)) and The Pile ([Gao et al., 2020](https://arxiv.org/abs/2101.00027)) both without the *Books3* subset, and StarCoder ([Li et al., 2023](https://arxiv.org/abs/2305.06161)).
|
137 |
|
138 |
+
| Metric | Value |
|
139 |
+
|-----------------------|---------------------------|
|
140 |
+
| Toxigen (0-shot) | 43.40 |
|
141 |
|
142 |
**The model name is inspired by the small but formidable character from 'Guardians of the Galaxy'. Similar to its namesake, this model, with its 3 billion parameters, showcases remarkable efficiency and effectiveness, challenging larger models despite its smaller size."*
|
143 |
|