polish_medical_leaderboard

Restarting

App Files Files Community

djstrong commited on Aug 23

Commit

acebd17

•

1 Parent(s): e333ea5

update

Browse files

Files changed (3) hide show

README.md +2 -2
src/about.py +7 -46
src/leaderboard/read_evals.py +1 -2

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Open PL LLM Leaderboard
-emoji: 🏆🇵🇱
 colorFrom: gray
 colorTo: red
 sdk: gradio

 ---
+title: Polish Medical Leaderboard
+emoji: 🇵🇱🩺🏆
 colorFrom: gray
 colorTo: red
 sdk: gradio

src/about.py CHANGED Viewed

@@ -129,12 +129,7 @@ TITLE = """
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = f"""
-The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
-For now, models are tested without theirs templates.
-Almost every task has two versions: regex and multiple choice.
-* _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
-* _mc suffix means that a model is scored against every possible class (suitable also for base models)
 Average columns are normalized against scores by "Baseline (majority class)".
@@ -164,43 +159,13 @@ or join our [Discord SpeakLeash](https://discord.gg/FfYp4V6y3R)
 Tasks taken into account while calculating averages:
 * Average: {', '.join(all_tasks)}
-* Avg g: {', '.join(g_tasks)}
-* Avg mc: {', '.join(mc_tasks)}
-* Avg RAG: {', '.join(rag_tasks)}
-| Task                            | Dataset                               | Metric    | Type            |
-|---------------------------------|---------------------------------------|-----------|-----------------|
-| polemo2_in                      | allegro/klej-polemo2-in               | accuracy  | generate_until  |
-| polemo2_in_mc      | allegro/klej-polemo2-in               | accuracy  | multiple_choice |
-| polemo2_out                     | allegro/klej-polemo2-out              | accuracy  | generate_until  |
-| polemo2_out_mc     | allegro/klej-polemo2-out              | accuracy  | multiple_choice |
-| 8tags_mc    | sdadas/8tags                          | accuracy  | multiple_choice |
-| 8tags_g              | sdadas/8tags                          | accuracy  | generate_until  |
-| belebele_mc           | facebook/belebele                     | accuracy  | multiple_choice  |
-| belebele_g           | facebook/belebele                     | accuracy  | generate_until  |
-| dyk_mc      | allegro/klej-dyk                      | binary F1 | multiple_choice |
-| dyk_g                | allegro/klej-dyk                      | binary F1 | generate_until  |
-| ppc_mc      | sdadas/ppc                            | accuracy  | multiple_choice |
-| ppc_g                | sdadas/ppc                            | accuracy  | generate_until  |
-| psc_mc      | allegro/klej-psc                      | binary F1 | multiple_choice |
-| psc_g                | allegro/klej-psc                      | binary F1 | generate_until  |
-| cbd_mc      | ptaszynski/PolishCyberbullyingDataset | macro F1  | multiple_choice |
-| cbd_g                | ptaszynski/PolishCyberbullyingDataset | macro F1  | generate_until  |
-| klej_ner_mc | allegro/klej-nkjp-ner                 | accuracy  | multiple_choice |
-| klej_ner_g           | allegro/klej-nkjp-ner                 | accuracy  | generate_until  |
-| polqa_reranking_mc | ipipan/polqa   | accuracy | multiple_choice |
-| polqa_open_book_g | ipipan/polqa   | levenshtein | generate_until |
-| polqa_closed_book_g | ipipan/polqa   | levenshtein | generate_until |
-| poleval2018_task3_test_10k | enelpol/poleval2018_task3_test_10k   | word perplexity | other |
-| polish_poquad_open_book | enelpol/poleval2018_task3_test_10k   | levenshtein | generate_until |
-| polish_eq_bench_first_turn | speakleash/EQ-Bench-PL   | eq_bench | generate_until |
-| polish_eq_bench | speakleash/EQ-Bench-PL   | eq_bench | generate_until |
 ## Reproducibility
 To reproduce our results, you need to clone the repository:
 ```
-git clone https://github.com/speakleash/lm-evaluation-harness.git -b polish3
 cd lm-evaluation-harness
 pip install -e .
 ```
@@ -208,18 +173,14 @@ pip install -e .
 and run benchmark for 0-shot and 5-shot:
 ```
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_generate --num_fewshot 0 --output_path results/ --log_samples
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_mc --num_fewshot 0 --output_path results/ --log_samples
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_generate_few --num_fewshot 5 --output_path results/ --log_samples
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_mc --num_fewshot 5 --output_path results/ --log_samples
 ```
 With chat templates:
 ```
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_generate --num_fewshot 0 --output_path results/ --log_samples --apply_chat_template
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_mc --num_fewshot 0 --output_path results/ --log_samples --apply_chat_template
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_generate_few --num_fewshot 5 --output_path results/ --log_samples --apply_chat_template
-lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_mc --num_fewshot 5 --output_path results/ --log_samples --apply_chat_template
 ```
 ## List of Polish models

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = f"""
+The leaderboard evaluates language models on Polish Board Certification Examinations (Państwowy Egzamin Specjalizacyjny) from years 2018-2022.
 Average columns are normalized against scores by "Baseline (majority class)".
 Tasks taken into account while calculating averages:
 * Average: {', '.join(all_tasks)}
 ## Reproducibility
 To reproduce our results, you need to clone the repository:
 ```
+git clone https://github.com/speakleash/lm-evaluation-harness.git -b polish4
 cd lm-evaluation-harness
 pip install -e .
 ```
 and run benchmark for 0-shot and 5-shot:
 ```
+lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_pes --num_fewshot 0 --output_path results/ --log_samples
+lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_pes --num_fewshot 5 --output_path results/ --log_samples
 ```
 With chat templates:
 ```
+lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_pes --num_fewshot 0 --output_path results/ --log_samples --apply_chat_template
+lm_eval --model hf --model_args pretrained=speakleash/Bielik-7B-Instruct-v0.1 --tasks polish_pes --num_fewshot 5 --output_path results/ --log_samples --apply_chat_template
 ```
 ## List of Polish models

src/leaderboard/read_evals.py CHANGED Viewed

@@ -387,6 +387,7 @@ def get_raw_eval_results(results_path: str, requests_path: str, metadata) -> lis
     model_result_filepaths = []
     for root, _, files in os.walk(results_path):
         # We should only have json files in model results
         if len(files) == 0 or any([not f.endswith(".json") for f in files]):
             continue
@@ -398,8 +399,6 @@ def get_raw_eval_results(results_path: str, requests_path: str, metadata) -> lis
             files = [files[-1]]
         for file in files:
-            print(file)
-            # if '_polish_pes_' not in file: continue
             model_result_filepaths.append(os.path.join(root, file))
     # print('PATHS:', model_result_filepaths)

     model_result_filepaths = []
     for root, _, files in os.walk(results_path):
+        if '_polish_pes_' not in root: continue
         # We should only have json files in model results
         if len(files) == 0 or any([not f.endswith(".json") for f in files]):
             continue
             files = [files[-1]]
         for file in files:
             model_result_filepaths.append(os.path.join(root, file))
     # print('PATHS:', model_result_filepaths)