Spaces:

ricdomolm
/

caselawqa_leaderboard

Running

App Files Files Community

RicardoDominguez commited on Sep 16

Commit

3ae2781

•

1 Parent(s): 45c6daa

sc and songer

Browse files

Files changed (1) hide show

src/about.py +8 -8

src/about.py CHANGED Viewed

@@ -13,8 +13,8 @@ class Task:
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
     task0 = Task("caselawqa", "exact_match,default", "CaselawQA")
-    task1 = Task("caselawqa_tiny", "exact_match,default", "CaselawQA Tiny")
-    task2 = Task("caselawqa_hard", "exact_match,default", "CaselawQA Hard")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
@@ -34,14 +34,14 @@ From a substantive legal perspective, efficient solutions to such classification
 LLM_BENCHMARKS_TEXT = f"""
 ## Introduction
-CaselawQA is a benchmark comprising legal classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases.
 The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database.
-The questions are randomly selected from the test sets of the [Lawma tasks](https://huggingface.co/datasets/ricdomolm/lawma-tasks).\
 From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement.
 From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research.
-CaselawQA also includes two additional subsets: CaselawQA Tiny and CaselawQA Hard.
-CaselawQA Tiny consists of 49 Lawma tasks with fewer than 150 training examples.
-CaselawQA Hard comprises tasks where [Lawma 70B](https://huggingface.co/ricdomolm/lawma-70b) achieves less than 70% accuracy.
 You can find more information in the [Lawma arXiv preprint](https://arxiv.org/abs/2407.16615) and [GitHub repository](https://github.com/socialfoundations/lawma).
@@ -50,7 +50,7 @@ You can find more information in the [Lawma arXiv preprint](https://arxiv.org/ab
 With evaluate CaselawQA using [this](https://github.com/socialfoundations/lm-evaluation-harness/tree/caselawqa) LM Eval Harness implementation:
 ```bash
-lm_eval --model hf --model_args "pretrained=<your_model>,dtype=bfloat16" --tasks caselawqa,caselawqa_tiny,caselawqa_hard --output_path=<output_path>
 """
 EVALUATION_QUEUE_TEXT = """

 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
     task0 = Task("caselawqa", "exact_match,default", "CaselawQA")
+    task1 = Task("caselawqa_sc", "exact_match,default", "Supreme Court")
+    task2 = Task("caselawqa_songer", "exact_match,default", "Courts of Appeals")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
 LLM_BENCHMARKS_TEXT = f"""
 ## Introduction
+CaselawQA is a benchmark comprising legal classification tasks derived from the Supreme Court and Songer Court of Appeals legal databases.
 The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database.
+The questions are randomly selected from the test sets of the [Lawma tasks](https://huggingface.co/datasets/ricdomolm/lawma-tasks).
 From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement.
 From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research.
 You can find more information in the [Lawma arXiv preprint](https://arxiv.org/abs/2407.16615) and [GitHub repository](https://github.com/socialfoundations/lawma).
 With evaluate CaselawQA using [this](https://github.com/socialfoundations/lm-evaluation-harness/tree/caselawqa) LM Eval Harness implementation:
 ```bash
+lm_eval --model hf --model_args "pretrained=<your_model>,dtype=bfloat16" --tasks caselawqa --output_path=<output_path>
 """
 EVALUATION_QUEUE_TEXT = """