leaderboard-pr-bot commited on
Commit
e2e7028
1 Parent(s): b256b38

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +133 -15
README.md CHANGED
@@ -1,29 +1,134 @@
1
  ---
2
  language:
3
  - code
4
- pipeline_tag: text-generation
5
  tags:
6
  - llama-2
7
- license: llama2
8
  widget:
9
- - example_title: Hello world (Python)
10
- messages:
11
- - role: system
12
- content: You are a helpful and honest code assistant
13
- - role: user
14
- content: Print a hello world in Python
15
- - example_title: Sum of sublists (Python)
16
- messages:
17
- - role: system
18
- content: You are a helpful and honest code assistant expert in JavaScript. Please, provide all answers to programming questions in JavaScript
19
- - role: user
20
- content: Write a function that computes the set of sums of all contiguous sublists of a given list.
 
 
21
  inference:
22
  parameters:
23
  max_new_tokens: 200
24
  stop:
25
  - </s>
26
  - <step>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ---
28
  # **Code Llama**
29
  Code Llama is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 70B instruct-tuned version in the Hugging Face Transformers format. This model is designed for general code synthesis and understanding. Links to other models can be found in the index at the bottom.
@@ -215,4 +320,17 @@ See evaluations for the main models and detailed ablations in Section 3 and safe
215
 
216
  Code Llama and its variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Code Llama’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Code Llama, developers should perform safety testing and tuning tailored to their specific applications of the model.
217
 
218
- Please see the Responsible Use Guide available available at [https://ai.meta.com/llama/responsible-use-guide](https://ai.meta.com/llama/responsible-use-guide).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - code
4
+ license: llama2
5
  tags:
6
  - llama-2
7
+ pipeline_tag: text-generation
8
  widget:
9
+ - example_title: Hello world (Python)
10
+ messages:
11
+ - role: system
12
+ content: You are a helpful and honest code assistant
13
+ - role: user
14
+ content: Print a hello world in Python
15
+ - example_title: Sum of sublists (Python)
16
+ messages:
17
+ - role: system
18
+ content: You are a helpful and honest code assistant expert in JavaScript. Please,
19
+ provide all answers to programming questions in JavaScript
20
+ - role: user
21
+ content: Write a function that computes the set of sums of all contiguous sublists
22
+ of a given list.
23
  inference:
24
  parameters:
25
  max_new_tokens: 200
26
  stop:
27
  - </s>
28
  - <step>
29
+ model-index:
30
+ - name: CodeLlama-70b-Instruct-hf
31
+ results:
32
+ - task:
33
+ type: text-generation
34
+ name: Text Generation
35
+ dataset:
36
+ name: AI2 Reasoning Challenge (25-Shot)
37
+ type: ai2_arc
38
+ config: ARC-Challenge
39
+ split: test
40
+ args:
41
+ num_few_shot: 25
42
+ metrics:
43
+ - type: acc_norm
44
+ value: 55.03
45
+ name: normalized accuracy
46
+ source:
47
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
48
+ name: Open LLM Leaderboard
49
+ - task:
50
+ type: text-generation
51
+ name: Text Generation
52
+ dataset:
53
+ name: HellaSwag (10-Shot)
54
+ type: hellaswag
55
+ split: validation
56
+ args:
57
+ num_few_shot: 10
58
+ metrics:
59
+ - type: acc_norm
60
+ value: 77.24
61
+ name: normalized accuracy
62
+ source:
63
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
64
+ name: Open LLM Leaderboard
65
+ - task:
66
+ type: text-generation
67
+ name: Text Generation
68
+ dataset:
69
+ name: MMLU (5-Shot)
70
+ type: cais/mmlu
71
+ config: all
72
+ split: test
73
+ args:
74
+ num_few_shot: 5
75
+ metrics:
76
+ - type: acc
77
+ value: 56.4
78
+ name: accuracy
79
+ source:
80
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
81
+ name: Open LLM Leaderboard
82
+ - task:
83
+ type: text-generation
84
+ name: Text Generation
85
+ dataset:
86
+ name: TruthfulQA (0-shot)
87
+ type: truthful_qa
88
+ config: multiple_choice
89
+ split: validation
90
+ args:
91
+ num_few_shot: 0
92
+ metrics:
93
+ - type: mc2
94
+ value: 50.44
95
+ source:
96
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
97
+ name: Open LLM Leaderboard
98
+ - task:
99
+ type: text-generation
100
+ name: Text Generation
101
+ dataset:
102
+ name: Winogrande (5-shot)
103
+ type: winogrande
104
+ config: winogrande_xl
105
+ split: validation
106
+ args:
107
+ num_few_shot: 5
108
+ metrics:
109
+ - type: acc
110
+ value: 74.51
111
+ name: accuracy
112
+ source:
113
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
114
+ name: Open LLM Leaderboard
115
+ - task:
116
+ type: text-generation
117
+ name: Text Generation
118
+ dataset:
119
+ name: GSM8k (5-shot)
120
+ type: gsm8k
121
+ config: main
122
+ split: test
123
+ args:
124
+ num_few_shot: 5
125
+ metrics:
126
+ - type: acc
127
+ value: 46.25
128
+ name: accuracy
129
+ source:
130
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=codellama/CodeLlama-70b-Instruct-hf
131
+ name: Open LLM Leaderboard
132
  ---
133
  # **Code Llama**
134
  Code Llama is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 70B instruct-tuned version in the Hugging Face Transformers format. This model is designed for general code synthesis and understanding. Links to other models can be found in the index at the bottom.
 
320
 
321
  Code Llama and its variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Code Llama’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Code Llama, developers should perform safety testing and tuning tailored to their specific applications of the model.
322
 
323
+ Please see the Responsible Use Guide available available at [https://ai.meta.com/llama/responsible-use-guide](https://ai.meta.com/llama/responsible-use-guide).
324
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
325
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_codellama__CodeLlama-70b-Instruct-hf)
326
+
327
+ | Metric |Value|
328
+ |---------------------------------|----:|
329
+ |Avg. |59.98|
330
+ |AI2 Reasoning Challenge (25-Shot)|55.03|
331
+ |HellaSwag (10-Shot) |77.24|
332
+ |MMLU (5-Shot) |56.40|
333
+ |TruthfulQA (0-shot) |50.44|
334
+ |Winogrande (5-shot) |74.51|
335
+ |GSM8k (5-shot) |46.25|
336
+