CobraMamba
/

mamba-gpt-3b

@@ -175,30 +175,30 @@ We evaluated OpenLLaMA on a wide range of tasks using [lm-evaluation-harness](ht
 The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens.  We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
-| **Task/Metric**        | Mamba-GPT 3B | LLaMA 7B | OpenLLaMA 7B | OpenLLaMA 3B | OpenLLaMA 13B 600BT |
-| ---------------------- | -------- | -------- | ------------ | ------------ | ------------------- |
-| anli_r1/acc            | **0.35**     | 0.35     | 0.33         | 0.33         | 0.33                |
-| anli_r2/acc            | 0.33     | 0.34     | 0.36         | 0.32         | 0.35                |
-| anli_r3/acc            | 0.35     | 0.37     | 0.38         | 0.35         | 0.38                |
-| arc_challenge/acc      | 0.35     | 0.39     | 0.37         | 0.34         | 0.39                |
-| arc_challenge/acc_norm | 0.37     | 0.41     | 0.38         | 0.37         | 0.42                |
-| arc_easy/acc           | 0.71     | 0.68     | 0.72         | 0.69         | 0.74                |
-| arc_easy/acc_norm      | 0.65     | 0.52     | 0.68         | 0.65         | 0.70                |
-| boolq/acc              | **0.72**     | 0.56     | 0.53         | 0.66         | 0.71                |
-| hellaswag/acc          | 0.49     | 0.36     | 0.63         | 0.43         | 0.54                |
-| hellaswag/acc_norm     | 0.66     | 0.73     | 0.72         | 0.67         | 0.73                |
-| openbookqa/acc         | 0.26     | 0.29     | 0.30         | 0.27         | 0.30                |
-| openbookqa/acc_norm    | 0.40     | 0.41     | 0.40         | 0.40         | 0.41                |
-| piqa/acc               | 0.76     | 0.78     | 0.76         | 0.75         | 0.77                |
-| piqa/acc_norm          | 0.76     | 0.78     | 0.77         | 0.76         | 0.78                |
-| record/em              | 0.88     | 0.91     | 0.89         | 0.88         | 0.90                |
-| record/f1              | 0.88     | 0.91     | 0.90         | 0.89         | 0.90                |
-| rte/acc                | 0.55     | 0.56     | 0.60         | 0.58         | 0.65                |
-| truthfulqa_mc/mc1      | **0.27**     | 0.21     | 0.23         | 0.22         | 0.22                |
-| truthfulqa_mc/mc2      | **0.37**     | 0.34     | 0.35         | 0.35         | 0.35                |
-| wic/acc                | 0.49     | 0.50     | 0.51         | 0.48         | 0.49                |
-| winogrande/acc         | 0.63     | 0.68     | 0.67         | 0.62         | 0.67                |
-| Average                | 0.53     | 0.53     | 0.55         | 0.52         | 0.56                |
 We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.

 The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens.  We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
+| **Task/Metric**        | finetuned-GPT 3B  | OpenLLaMA 3B |
+| ---------------------- | -------- | ------------ |
+| anli_r1/acc            | **0.35** | 0.33         |
+| anli_r2/acc            | **0.33** | 0.32         |
+| anli_r3/acc            | 0.35     | 0.35         |
+| arc_challenge/acc      | **0.35** | 0.34         |
+| arc_challenge/acc_norm | 0.37     | 0.37         |
+| arc_easy/acc           | **0.71** | 0.69         |
+| arc_easy/acc_norm      | 0.65     | 0.65         |
+| boolq/acc              | **0.72** | 0.66         |
+| hellaswag/acc          | **0.49** | 0.43         |
+| hellaswag/acc_norm     | 0.66     | 0.67         |
+| openbookqa/acc         | 0.26     | 0.27         |
+| openbookqa/acc_norm    | 0.40     | 0.40         |
+| piqa/acc               | **0.76** | 0.75         |
+| piqa/acc_norm          | 0.76     | 0.76         |
+| record/em              | 0.88     | 0.88         |
+| record/f1              | 0.88     | 0.89         |
+| rte/acc                | 0.55     | 0.58         |
+| truthfulqa_mc/mc1      | **0.27** | 0.22         |
+| truthfulqa_mc/mc2      | **0.37** | 0.35         |
+| wic/acc                | **0.49** | 0.48         |
+| winogrande/acc         | **0.63** | 0.62         |
+| Average                | **0.53** | 0.52         |
 We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.