File size: 4,947 Bytes
aa5e3a7
981697b
ae54b0b
981697b
aa5e3a7
 
3c3f244
aa5e3a7
 
981697b
ae54b0b
 
a7cff2a
 
 
 
 
 
71b613c
ae54b0b
981697b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79f0b7a
981697b
 
3980dde
981697b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3980dde
981697b
 
 
 
 
 
 
 
 
3980dde
981697b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: BLEURT
emoji: 🤗 
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018)
  and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune
  it for your specific application (the latter is expected to perform better).

  See the project's README at https://github.com/google-research/bleurt#readme for more information.
---

# Metric Card for BLEURT


## Metric Description
BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations. 

It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better).
See the project's [README](https://github.com/google-research/bleurt#readme) for more information.

## Intended Uses
BLEURT is intended to be used for evaluating text produced by language models. 

## How to Use

This metric takes as input lists of predicted sentences and reference sentences:

```python
>>> from evaluate import load
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", module_type="metric")
>>> results = bleurt.compute(predictions=predictions, references=references)
```

### Inputs
- **predictions** (`list` of `str`s): List of generated sentences to score.
- **references** (`list` of `str`s): List of references to compare to.
- **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`. 

### Output Values
- **scores** : a `list` of scores, one per prediction. 

Output Example:
```python
{'scores': [1.0295498371124268, 1.0445425510406494]}

```

BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts. 

#### Values from Popular Papers

The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore.

BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]).

### Examples

Example with the default model:
```python
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", module_type="metric")
>>> results = bleurt.compute(predictions=predictions, references=references)
>>> print(results)
{'scores': [1.0295498371124268, 1.0445425510406494]}
```

Example with the `"bleurt-base-128"` model checkpoint:
```python
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> bleurt = load("bleurt", module_type="metric", checkpoint="bleurt-base-128")
>>> results = bleurt.compute(predictions=predictions, references=references)
>>> print(results)
{'scores': [1.0295498371124268, 1.0445425510406494]}
```

## Limitations and Bias
The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected.

Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data. 

Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue.


## Citation
```bibtex
@inproceedings{bleurt,
  title={BLEURT: Learning Robust Metrics for Text Generation},
  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
  booktitle={ACL},
  year={2020},
  url={https://arxiv.org/abs/2004.04696}
}
```

## Further References
- The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/)