PathFinderKR
/

Waktaverse-Llama-3-KO-8B-Instruct

@@ -6,6 +6,9 @@ license: llama3
 library_name: transformers
 datasets:
 - MarkrAI/KoCommercial-Dataset
 ---
 # Waktaverse-Llama-3-KO-8B-Instruct Model Card
@@ -199,33 +202,138 @@ packing=True
 <!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
 ## Technical Specifications

 library_name: transformers
 datasets:
 - MarkrAI/KoCommercial-Dataset
+tags:
+- llama
+- llama-3
 ---
 # Waktaverse-Llama-3-KO-8B-Instruct Model Card
 <!-- This section describes the evaluation protocols and provides the results. -->
+### Metrics
+#### English
+- **AI2 Reasoning Challenge (25-shot):** a set of grade-school science questions.
+- **HellaSwag (10-shot):** a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
+- **MMLU (5-shot):** a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
+- **TruthfulQA (0-shot):** a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
+- **Winogrande (5-shot):** an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
+- **GSM8k (5-shot):** diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
+#### Korean
+- **Ko-HellaSwag:**
+- **Ko-MMLU:**
+- **Ko-Arc:**
+- **Ko-Truthful QA:**
+- **Ko-CommonGen V2:**
 ### Results
+#### English
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>Waktaverse Llama 3 8B</strong>
+   </td>
+   <td><strong>Llama 3 8B</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>Average
+   </td>
+   <td>66.77
+   </td>
+   <td>66.87
+   </td>
+  </tr>
+  <tr>
+   <td>ARC
+   </td>
+   <td>60.32
+   </td>
+   <td>60.75
+   </td>
+  </tr>
+  <tr>
+   <td>HellaSwag
+   </td>
+   <td>78.55
+   </td>
+   <td>78.55
+   </td>
+  </tr>
+  <tr>
+   <td>MMLU
+   </td>
+   <td>67.9
+   </td>
+   <td>67.07
+   </td>
+  </tr>
+  <tr>
+   <td>Winograde
+   </td>
+   <td>74.27
+   </td>
+   <td>74.51
+   </td>
+  <tr>
+    <td>GSM8K
+   </td>
+   <td>70.36
+   </td>
+   <td>68.69
+   </td>
+  </tr>
+</table>
+#### Korean
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>Waktaverse Llama 3 8B</strong>
+   </td>
+   <td><strong>Llama 3 8B</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>Ko-HellaSwag:
+   </td>
+   <td>0
+   </td>
+   <td>0
+   </td>
+  </tr>
+  <tr>
+   <td>Ko-MMLU:
+   </td>
+   <td>0
+   </td>
+   <td>0
+   </td>
+  </tr>
+  <tr>
+   <td>Ko-Arc:
+   </td>
+   <td>0
+   </td>
+   <td>0
+   </td>
+  </tr>
+  <tr>
+   <td>Ko-Truthful QA:
+   </td>
+   <td>0
+   </td>
+   <td>0
+   </td>
+  </tr>
+  <tr>
+   <td>Ko-CommonGen V2:
+   </td>
+   <td>0
+   </td>
+   <td>0
+   </td>
+</table>
 ## Technical Specifications