PathFinderKR commited on
Commit
cd0ac48
1 Parent(s): 7b1ecff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -21
README.md CHANGED
@@ -6,6 +6,9 @@ license: llama3
6
  library_name: transformers
7
  datasets:
8
  - MarkrAI/KoCommercial-Dataset
 
 
 
9
  ---
10
 
11
  # Waktaverse-Llama-3-KO-8B-Instruct Model Card
@@ -199,33 +202,138 @@ packing=True
199
 
200
  <!-- This section describes the evaluation protocols and provides the results. -->
201
 
202
- ### Testing Data, Factors & Metrics
203
 
204
- #### Testing Data
205
 
206
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
207
 
208
- [More Information Needed]
209
-
210
- #### Factors
211
-
212
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
213
-
214
- [More Information Needed]
215
-
216
- #### Metrics
217
-
218
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
219
-
220
- [More Information Needed]
221
 
 
 
 
 
 
 
222
  ### Results
223
 
224
- [More Information Needed]
225
-
226
- #### Summary
227
-
228
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
 
230
  ## Technical Specifications
231
 
 
6
  library_name: transformers
7
  datasets:
8
  - MarkrAI/KoCommercial-Dataset
9
+ tags:
10
+ - llama
11
+ - llama-3
12
  ---
13
 
14
  # Waktaverse-Llama-3-KO-8B-Instruct Model Card
 
202
 
203
  <!-- This section describes the evaluation protocols and provides the results. -->
204
 
205
+ ### Metrics
206
 
207
+ #### English
208
 
209
+ - **AI2 Reasoning Challenge (25-shot):** a set of grade-school science questions.
210
+ - **HellaSwag (10-shot):** a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
211
+ - **MMLU (5-shot):** a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
212
+ - **TruthfulQA (0-shot):** a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
213
+ - **Winogrande (5-shot):** an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
214
+ - **GSM8k (5-shot):** diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
215
 
216
+ #### Korean
 
 
 
 
 
 
 
 
 
 
 
 
217
 
218
+ - **Ko-HellaSwag:**
219
+ - **Ko-MMLU:**
220
+ - **Ko-Arc:**
221
+ - **Ko-Truthful QA:**
222
+ - **Ko-CommonGen V2:**
223
+
224
  ### Results
225
 
226
+ #### English
227
+
228
+ <table>
229
+ <tr>
230
+ <td><strong>Benchmark</strong>
231
+ </td>
232
+ <td><strong>Waktaverse Llama 3 8B</strong>
233
+ </td>
234
+ <td><strong>Llama 3 8B</strong>
235
+ </td>
236
+ </tr>
237
+ <tr>
238
+ <td>Average
239
+ </td>
240
+ <td>66.77
241
+ </td>
242
+ <td>66.87
243
+ </td>
244
+ </tr>
245
+ <tr>
246
+ <td>ARC
247
+ </td>
248
+ <td>60.32
249
+ </td>
250
+ <td>60.75
251
+ </td>
252
+ </tr>
253
+ <tr>
254
+ <td>HellaSwag
255
+ </td>
256
+ <td>78.55
257
+ </td>
258
+ <td>78.55
259
+ </td>
260
+ </tr>
261
+ <tr>
262
+ <td>MMLU
263
+ </td>
264
+ <td>67.9
265
+ </td>
266
+ <td>67.07
267
+ </td>
268
+ </tr>
269
+ <tr>
270
+ <td>Winograde
271
+ </td>
272
+ <td>74.27
273
+ </td>
274
+ <td>74.51
275
+ </td>
276
+ <tr>
277
+ <td>GSM8K
278
+ </td>
279
+ <td>70.36
280
+ </td>
281
+ <td>68.69
282
+ </td>
283
+ </tr>
284
+ </table>
285
+
286
+ #### Korean
287
+
288
+ <table>
289
+ <tr>
290
+ <td><strong>Benchmark</strong>
291
+ </td>
292
+ <td><strong>Waktaverse Llama 3 8B</strong>
293
+ </td>
294
+ <td><strong>Llama 3 8B</strong>
295
+ </td>
296
+ </tr>
297
+ <tr>
298
+ <td>Ko-HellaSwag:
299
+ </td>
300
+ <td>0
301
+ </td>
302
+ <td>0
303
+ </td>
304
+ </tr>
305
+ <tr>
306
+ <td>Ko-MMLU:
307
+ </td>
308
+ <td>0
309
+ </td>
310
+ <td>0
311
+ </td>
312
+ </tr>
313
+ <tr>
314
+ <td>Ko-Arc:
315
+ </td>
316
+ <td>0
317
+ </td>
318
+ <td>0
319
+ </td>
320
+ </tr>
321
+ <tr>
322
+ <td>Ko-Truthful QA:
323
+ </td>
324
+ <td>0
325
+ </td>
326
+ <td>0
327
+ </td>
328
+ </tr>
329
+ <tr>
330
+ <td>Ko-CommonGen V2:
331
+ </td>
332
+ <td>0
333
+ </td>
334
+ <td>0
335
+ </td>
336
+ </table>
337
 
338
  ## Technical Specifications
339