MoritzLaurer
/

deberta-v3-base-zeroshot-v1.1-all-33

@@ -57,6 +57,57 @@ print(output)
 ### Details on data and training
 The code for preparing the data and training & evaluating the model is fully open-source here: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
 ## Limitations and bias
 The model can only do text classification tasks.

 ### Details on data and training
 The code for preparing the data and training & evaluating the model is fully open-source here: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
+## Metrics
+Balanced accuracy metrics on all datasets.
+`deberta-v3-base-zeroshot-v1.1-all-33` was trained on all datasets, with only maximum 500 texts per class to avoid overfitting.
+The metrics on these datasets are therefore not strictly zeroshot, as the model has seen some data for each task.
+`deberta-v3-base-zeroshot-v1.1-heldout` indicates zeroshot performance on the respective dataset.
+To calculate these zeroshot metrics, the pipeline was run 28 times, each time with one dataset held out from training to simulate a zeroshot setup.
+![figure_base_v1.1](https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/results/fig_base_v1.1.png)
+|                            |   deberta-v3-base-mnli-fever-anli-ling-wanli-binary |   deberta-v3-base-zeroshot-v1.1-heldout |   deberta-v3-base-zeroshot-v1.1-all-33 |
+|:---------------------------|---------------------------:|----------------------------------------:|---------------------------------------:|
+| datasets mean (w/o nli)    |                       62   |                                    70.7 |                                   84   |
+| amazonpolarity (2)         |                       91.7 |                                    95.7 |                                   96   |
+| imdb (2)                   |                       87.3 |                                    93.6 |                                   94.5 |
+| appreviews (2)             |                       91.3 |                                    92.2 |                                   94.4 |
+| yelpreviews (2)            |                       95.1 |                                    97.4 |                                   98.3 |
+| rottentomatoes (2)         |                       83   |                                    88.7 |                                   90.8 |
+| emotiondair (6)            |                       46.5 |                                    42.6 |                                   74.5 |
+| emocontext (4)             |                       58.5 |                                    57.4 |                                   81.2 |
+| empathetic (32)            |                       31.3 |                                    37.3 |                                   52.7 |
+| financialphrasebank (3)    |                       78.3 |                                    68.9 |                                   91.2 |
+| banking77 (72)             |                       18.9 |                                    46   |                                   73.7 |
+| massive (59)               |                       44   |                                    56.6 |                                   78.9 |
+| wikitoxic_toxicaggreg (2)  |                       73.7 |                                    82.5 |                                   90.5 |
+| wikitoxic_obscene (2)      |                       77.3 |                                    91.6 |                                   92.6 |
+| wikitoxic_threat (2)       |                       83.5 |                                    95.2 |                                   96.7 |
+| wikitoxic_insult (2)       |                       79.6 |                                    91   |                                   91.6 |
+| wikitoxic_identityhate (2) |                       83.9 |                                    88   |                                   94.4 |
+| hateoffensive (3)          |                       55.2 |                                    66.1 |                                   86   |
+| hatexplain (3)             |                       44.1 |                                    57.6 |                                   76.9 |
+| biasframes_offensive (2)   |                       56.8 |                                    85.4 |                                   87   |
+| biasframes_sex (2)         |                       85.4 |                                    87   |                                   91.8 |
+| biasframes_intent (2)      |                       56.3 |                                    85.2 |                                   87.8 |
+| agnews (4)                 |                       77.3 |                                    80   |                                   90.5 |
+| yahootopics (10)           |                       53.6 |                                    57.7 |                                   72.8 |
+| trueteacher (2)            |                       51.4 |                                    49.5 |                                   82.4 |
+| spam (2)                   |                       51.8 |                                    50   |                                   97.2 |
+| wellformedquery (2)        |                       49.9 |                                    52.5 |                                   77.2 |
+| manifesto (56)             |                        5.8 |                                    18.9 |                                   39.1 |
+| capsotu (21)               |                       25.2 |                                    64   |                                   72.5 |
+| mnli_m (2)                 |                       92.4 |                                   nan   |                                   92.7 |
+| mnli_mm (2)                |                       92.4 |                                   nan   |                                   92.5 |
+| fevernli (2)               |                       89   |                                   nan   |                                   89.1 |
+| anli_r1 (2)                |                       79.4 |                                   nan   |                                   80   |
+| anli_r2 (2)                |                       68.4 |                                   nan   |                                   68.4 |
+| anli_r3 (2)                |                       66.2 |                                   nan   |                                   68   |
+| wanli (2)                  |                       81.6 |                                   nan   |                                   81.8 |
+| lingnli (2)                |                       88.4 |                                   nan   |                                   88.4 |
 ## Limitations and bias
 The model can only do text classification tasks.