Spaces:

NCSOFT
/

harim_plus

Running

File size: 7,711 Bytes

e91187e
91bc1d6
 
 
 
 
b5aef9f
e91187e
 
91bc1d6
 
 
 
 
 
e91187e
 
91bc1d6
2885a60
b5bad93
2885a60
 
 
 
 
 
 
 
 
 
e744fb3
 
 
2885a60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adcb90e
e744fb3
 
2885a60

---
title: HaRiM+
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.9
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  HaRiM+ is reference-less metric for summary quality evaluation which hurls the power of summarization model to estimate the quality of the summary-article pair. <br />
  Note that this metric is reference-free and do not require training. It is ready to go without reference text to compare with the generation nor any model training for scoring.
---


# HaRiM+
HaRiM+: Evaluating Summary Quality with Hallucination Risk, accepted at AACL-22 [paper](https://arxiv.org/abs/2211.12118). <br />
<br />
HaRiM+ is reference-less metric for summarization task which hurls the power of summarization model to estimate the quality of the summary-article pair. <br />
Note that this metric is reference-free and do not require training. It is ready to go without reference text to compare with the generation nor any model training for scoring.

## Quick Start
### install
```bash
pip install evaluate
```
### example
You can clone this space and run <code>python test_harim_score.py [--pretrained_name CKPTNAME_FOR_S2SLM] </code> or try below. <br />
(running on CPU is possible, but expected to be too slow for use.)

```python
import evaluate
from pprint import pprint

art = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose 5-1 drubbing of Spain in the group stage in Brazil last year marked the end of the Iberian nation's six-year domination of the world game, scored two early goals at the Amsterdam Arena and held on against some determined Spain pressure in the second half for a 2-0 success. They became the first team to inflict two defeats on Del Bosque since he took over in 2008 but the gruff 64-year-old had used the match to try out several new faces and he fielded a largely experimental, second-string team. Stefan de Vrij (right) headed Holland in front against Spain at the Amsterdam Arena on Tuesday Gerard Pique (left) could do nothing to stop Davy Klaassen doubling the Dutch advantage Malaga forward Juanmi and Sevilla midfielder Vitolo became the 55th and 56th players to debut under Del Bosque, while the likes of goalkeeper David de Gea, defenders Raul Albiol, Juan Bernat and Dani Carvajal and midfielder Mario Suarez all started the game. 'The national team's state of health is good,' centre back Gerard Pique told reporters. 'We are in a process where players are coming into the team and gathering experience,' added the Barcelona defender. 'We are second in qualifying (for Euro 2016) and these friendly games are for experimenting. 'I am not that worried about this match because we lost friendlies in previous years and then ended up winning titles.' David de Gea was given a start by Vicente del Bosque but could not keep out De Vrij's header here Dani Carvajal (centre) was another squad player given a chance to impress against Holland Del Bosque will be confident he can find the right mix of players to secure Spain's berth at Euro 2016 in France next year, when they will be chasing an unprecedented third straight title. Slovakia are the surprise leaders in qualifying Group C thanks to a 2-1 win over Spain in Zilina in October and have a maximum 15 points from five of 10 matches. Spain are second on 12 points, three ahead of Ukraine, who they beat 1-0 in Seville on Friday. Del Bosque's side host Slovakia in September in a match that could decide who goes through to the finals as group winners. 'The team is in good shape,' forward Pedro told reporters. 'We have a very clear idea of our playing style and we are able to count on people who are gradually making a place for themselves in the team.'"""

summaries = [
  "holland beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
  "holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque will be hoping to find the right mix of players to the world cup . gerard pique could make the right mix of players to the tournament .",
  "del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
  "holland could not beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
]
articles = [art] * len(summaries)

scorer = evaluate.load('NCSOFT/harim_plus')
scores = scorer.compute(predictions = summaries, references = articles) # use_aggregator=False, bsz=32, return_details=False, tokenwise_score=False)
pprint([round(s,4) for s in scores])
>>> [2.7096, 3.7338, 2.669, 2.4039, 2.3759]
```

## Powering HaRiM+ score with other summarization model checkpoints
HaRiM+ accepts any checkpoint compatible with <code>transformers.AutoModelForSeq2SeqLM</code> which is encoder-decoder model. <br />
In principle the HaRiM+ score expected to work on machine-translation too. It works but not better than BARTScore (Yuan et al.) while it excels in summarization task.  

```python

newharim = evaluate.load('NCSOFT/harim_plus', pretrained_name='local or ckpt name available')#, tokenizer=custom_tokenizer)
```

## Speed and Resource requirements
HaRiM+ requires GPU usage for practical speed, but only loads encoder-decoder model of your choice (Default \= facebook\/bart\-large\-cnn). Empirically, resource requirements and speed is similar to BERTScore.

## Citation
Please cite as follows
```
@inproceedings{son-etal-2022-harim,
    title = "{H}a{R}i{M}$^+$: Evaluating Summary Quality with Hallucination Risk",
    author = "Son, Seonil (Simon)  and
      Park, Junsoo  and
      Hwang, Jeong-in  and
      Lee, Junghwa  and
      Noh, Hyungjong  and
      Lee, Yeonsoo",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
    month = nov,
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-main.66",
    pages = "895--924",
    abstract = "One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.",
}
```