alirezamsh
/

quip-512-mocha

@@ -9,9 +9,9 @@ language:
 # Answer Overlap Module of QAFactEval Metric
-This is the span scorer module, used in [RQUGE paper]() to evaluate the generated questions of the question generation task.
-The model was originally used in [QAFactEval]() for computing the semantic similarity of the generated answer span, given the reference answer, context, and question in the question answering task.
-It outputs a 1-5 answer overlap score. The scorer is trained on their MOCHA dataset (initialized from [Jia et al. (2021)]()), consisting of 40k crowdsourced judgments on QA model outputs.
 The input to the model is defined as:
 ```
@@ -64,12 +64,26 @@ print(outputs)
     abstract = "Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14{\%} average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.",
 }
-@misc{mohammadshahi2022rquge,
-    title={RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question},
-    author={Alireza Mohammadshahi and Thomas Scialom and Majid Yazdani and Pouya Yanki and Angela Fan and James Henderson and Marzieh Saeidi},
-    year={2022},
-    eprint={2211.01482},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
 }
 ```

 # Answer Overlap Module of QAFactEval Metric
+This is the span scorer module, used in [RQUGE paper](https://aclanthology.org/2023.findings-acl.428/) to evaluate the generated questions of the question generation task.
+The model was originally used in [QAFactEval](https://aclanthology.org/2022.naacl-main.187/) for computing the semantic similarity of the generated answer span, given the reference answer, context, and question in the question answering task.
+It outputs a 1-5 answer overlap score. The scorer is trained on their MOCHA dataset (initialized from [Jia et al. (2021)](https://aclanthology.org/2020.emnlp-main.528/)), consisting of 40k crowdsourced judgments on QA model outputs.
 The input to the model is defined as:
 ```
     abstract = "Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14{\%} average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.",
 }
+@inproceedings{mohammadshahi-etal-2023-rquge,
+    title = "{RQUGE}: Reference-Free Metric for Evaluating Question Generation by Answering the Question",
+    author = "Mohammadshahi, Alireza  and
+      Scialom, Thomas  and
+      Yazdani, Majid  and
+      Yanki, Pouya  and
+      Fan, Angela  and
+      Henderson, James  and
+      Saeidi, Marzieh",
+    editor = "Rogers, Anna  and
+      Boyd-Graber, Jordan  and
+      Okazaki, Naoaki",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
+    month = jul,
+    year = "2023",
+    address = "Toronto, Canada",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.findings-acl.428",
+    doi = "10.18653/v1/2023.findings-acl.428",
+    pages = "6845--6867",
+    abstract = "Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.",
 }
 ```