macadeliccc commited on
Commit
2a9557b
1 Parent(s): fa411bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -62
README.md CHANGED
@@ -83,88 +83,92 @@ print(generate_response(prompt), "\n")
83
  ## Eval
84
 
85
  evaluation [colab](https://colab.research.google.com/drive/1FpwgsGzCR4tORTxAwUxpN3PcP22En2xk?usp=sharing)
86
-
87
  | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
88
  |---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
89
  |[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 41.31| 73.67| 61.69| 42.79| 54.87|
90
 
 
 
 
 
 
91
  ### AGIEval
92
  | Task |Version| Metric |Value| |Stderr|
93
  |------------------------------|------:|--------|----:|---|-----:|
94
- |agieval_aqua_rat | 0|acc |22.44|± | 2.62|
95
- | | |acc_norm|21.26|± | 2.57|
96
- |agieval_logiqa_en | 0|acc |34.87|± | 1.87|
97
- | | |acc_norm|35.79|± | 1.88|
98
- |agieval_lsat_ar | 0|acc |22.17|± | 2.75|
99
- | | |acc_norm|23.04|± | 2.78|
100
- |agieval_lsat_lr | 0|acc |43.14|± | 2.20|
101
- | | |acc_norm|45.10|± | 2.21|
102
- |agieval_lsat_rc | 0|acc |57.25|± | 3.02|
103
- | | |acc_norm|55.76|± | 3.03|
104
- |agieval_sat_en | 0|acc |71.84|± | 3.14|
105
- | | |acc_norm|71.84|± | 3.14|
106
- |agieval_sat_en_without_passage| 0|acc |44.17|± | 3.47|
107
- | | |acc_norm|41.75|± | 3.44|
108
- |agieval_sat_math | 0|acc |40.91|± | 3.32|
109
- | | |acc_norm|35.91|± | 3.24|
110
-
111
- Average: 41.31%
112
 
113
  ### GPT4All
114
  | Task |Version| Metric |Value| |Stderr|
115
  |-------------|------:|--------|----:|---|-----:|
116
- |arc_challenge| 0|acc |58.02|± | 1.44|
117
- | | |acc_norm|60.58|± | 1.43|
118
- |arc_easy | 0|acc |85.48|± | 0.72|
119
- | | |acc_norm|82.62|± | 0.78|
120
- |boolq | 1|acc |87.16|± | 0.59|
121
- |hellaswag | 0|acc |65.04|± | 0.48|
122
- | | |acc_norm|83.63|± | 0.37|
123
- |openbookqa | 0|acc |35.60|± | 2.14|
124
- | | |acc_norm|45.00|± | 2.23|
125
- |piqa | 0|acc |81.99|± | 0.90|
126
- | | |acc_norm|83.51|± | 0.87|
127
- |winogrande | 0|acc |73.16|± | 1.25|
128
-
129
- Average: 73.67%
130
 
131
  ### TruthfulQA
132
  | Task |Version|Metric|Value| |Stderr|
133
  |-------------|------:|------|----:|---|-----:|
134
- |truthfulqa_mc| 1|mc1 |44.31|± | 1.74|
135
- | | |mc2 |61.69|± | 1.50|
136
 
137
- Average: 61.69%
138
 
139
  ### Bigbench
140
  | Task |Version| Metric |Value| |Stderr|
141
  |------------------------------------------------|------:|---------------------|----:|---|-----:|
142
- |bigbench_causal_judgement | 0|multiple_choice_grade|59.47|± | 3.57|
143
- |bigbench_date_understanding | 0|multiple_choice_grade|66.67|± | 2.46|
144
- |bigbench_disambiguation_qa | 0|multiple_choice_grade|36.05|± | 3.00|
145
- |bigbench_geometric_shapes | 0|multiple_choice_grade|20.33|± | 2.13|
146
- | | |exact_str_match | 7.52|± | 1.39|
147
- |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|27.80|± | 2.01|
148
- |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|19.86|± | 1.51|
149
- |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|48.67|± | 2.89|
150
- |bigbench_movie_recommendation | 0|multiple_choice_grade|49.60|± | 2.24|
151
- |bigbench_navigate | 0|multiple_choice_grade|53.20|± | 1.58|
152
- |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|68.50|± | 1.04|
153
- |bigbench_ruin_names | 0|multiple_choice_grade|41.74|± | 2.33|
154
- |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|16.23|± | 1.17|
155
- |bigbench_snarks | 0|multiple_choice_grade|64.09|± | 3.58|
156
- |bigbench_sports_understanding | 0|multiple_choice_grade|70.69|± | 1.45|
157
- |bigbench_temporal_sequences | 0|multiple_choice_grade|37.70|± | 1.53|
158
- |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.44|± | 1.20|
159
- |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.60|± | 0.91|
160
- |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|48.67|± | 2.89|
161
-
162
- Average: 42.79%
163
-
164
- Average score: 54.87%
165
-
166
- Elapsed time: 02:53:28
167
-
168
  ## Citations
169
 
170
  Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.
 
83
  ## Eval
84
 
85
  evaluation [colab](https://colab.research.google.com/drive/1FpwgsGzCR4tORTxAwUxpN3PcP22En2xk?usp=sharing)
86
+ ## Summary of previuous evaluation
87
  | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
88
  |---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
89
  |[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 41.31| 73.67| 61.69| 42.79| 54.87|
90
 
91
+ ## Detailed current evaluation
92
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
93
+ |---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
94
+ |[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 42.25| 73.45| 63.44| 43.96| 55.77|
95
+
96
  ### AGIEval
97
  | Task |Version| Metric |Value| |Stderr|
98
  |------------------------------|------:|--------|----:|---|-----:|
99
+ |agieval_aqua_rat | 0|acc |21.26|± | 2.57|
100
+ | | |acc_norm|21.65|± | 2.59|
101
+ |agieval_logiqa_en | 0|acc |34.72|± | 1.87|
102
+ | | |acc_norm|35.64|± | 1.88|
103
+ |agieval_lsat_ar | 0|acc |26.96|± | 2.93|
104
+ | | |acc_norm|26.96|± | 2.93|
105
+ |agieval_lsat_lr | 0|acc |45.88|± | 2.21|
106
+ | | |acc_norm|46.08|± | 2.21|
107
+ |agieval_lsat_rc | 0|acc |59.48|± | 3.00|
108
+ | | |acc_norm|59.48|± | 3.00|
109
+ |agieval_sat_en | 0|acc |73.79|± | 3.07|
110
+ | | |acc_norm|73.79|± | 3.07|
111
+ |agieval_sat_en_without_passage| 0|acc |42.23|± | 3.45|
112
+ | | |acc_norm|41.26|± | 3.44|
113
+ |agieval_sat_math | 0|acc |37.27|± | 3.27|
114
+ | | |acc_norm|33.18|± | 3.18|
115
+
116
+ Average: 42.25%
117
 
118
  ### GPT4All
119
  | Task |Version| Metric |Value| |Stderr|
120
  |-------------|------:|--------|----:|---|-----:|
121
+ |arc_challenge| 0|acc |58.36|± | 1.44|
122
+ | | |acc_norm|58.02|± | 1.44|
123
+ |arc_easy | 0|acc |82.20|± | 0.78|
124
+ | | |acc_norm|77.40|± | 0.86|
125
+ |boolq | 1|acc |87.52|± | 0.58|
126
+ |hellaswag | 0|acc |67.50|± | 0.47|
127
+ | | |acc_norm|84.43|± | 0.36|
128
+ |openbookqa | 0|acc |34.40|± | 2.13|
129
+ | | |acc_norm|47.00|± | 2.23|
130
+ |piqa | 0|acc |81.61|± | 0.90|
131
+ | | |acc_norm|82.59|± | 0.88|
132
+ |winogrande | 0|acc |77.19|± | 1.18|
133
+
134
+ Average: 73.45%
135
 
136
  ### TruthfulQA
137
  | Task |Version|Metric|Value| |Stderr|
138
  |-------------|------:|------|----:|---|-----:|
139
+ |truthfulqa_mc| 1|mc1 |45.90|± | 1.74|
140
+ | | |mc2 |63.44|± | 1.56|
141
 
142
+ Average: 63.44%
143
 
144
  ### Bigbench
145
  | Task |Version| Metric |Value| |Stderr|
146
  |------------------------------------------------|------:|---------------------|----:|---|-----:|
147
+ |bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
148
+ |bigbench_date_understanding | 0|multiple_choice_grade|60.70|± | 2.55|
149
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|38.37|± | 3.03|
150
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|21.73|± | 2.18|
151
+ | | |exact_str_match | 0.00|± | 0.00|
152
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|35.00|± | 2.14|
153
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.57|± | 1.61|
154
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|50.33|± | 2.89|
155
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|45.00|± | 2.23|
156
+ |bigbench_navigate | 0|multiple_choice_grade|50.00|± | 1.58|
157
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|60.35|± | 1.09|
158
+ |bigbench_ruin_names | 0|multiple_choice_grade|51.12|± | 2.36|
159
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.26|± | 1.48|
160
+ |bigbench_snarks | 0|multiple_choice_grade|67.96|± | 3.48|
161
+ |bigbench_sports_understanding | 0|multiple_choice_grade|70.59|± | 1.45|
162
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|35.80|± | 1.52|
163
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.56|± | 1.18|
164
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.20|± | 0.90|
165
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|50.33|± | 2.89|
166
+
167
+ Average: 43.96%
168
+
169
+ Average score: 55.77%
170
+
171
+ Elapsed time: 02:43:45
 
172
  ## Citations
173
 
174
  Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.