Spaces:
Running
Running
title: BLEURT | |
emoji: 🤗 | |
colorFrom: blue | |
colorTo: red | |
sdk: gradio | |
sdk_version: 3.19.1 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- metric | |
description: >- | |
BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) | |
and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune | |
it for your specific application (the latter is expected to perform better). | |
See the project's README at https://github.com/google-research/bleurt#readme for more information. | |
# Metric Card for BLEURT | |
## Metric Description | |
BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations. | |
It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). | |
See the project's [README](https://github.com/google-research/bleurt#readme) for more information. | |
## Intended Uses | |
BLEURT is intended to be used for evaluating text produced by language models. | |
## How to Use | |
This metric takes as input lists of predicted sentences and reference sentences: | |
```python | |
>>> from evaluate import load | |
>>> predictions = ["hello there", "general kenobi"] | |
>>> references = ["hello there", "general kenobi"] | |
>>> bleurt = load("bleurt", module_type="metric") | |
>>> results = bleurt.compute(predictions=predictions, references=references) | |
``` | |
### Inputs | |
- **predictions** (`list` of `str`s): List of generated sentences to score. | |
- **references** (`list` of `str`s): List of references to compare to. | |
- **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`. | |
### Output Values | |
- **scores** : a `list` of scores, one per prediction. | |
Output Example: | |
```python | |
{'scores': [1.0295498371124268, 1.0445425510406494]} | |
``` | |
BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts. | |
#### Values from Popular Papers | |
The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore. | |
BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]). | |
### Examples | |
Example with the default model: | |
```python | |
>>> predictions = ["hello there", "general kenobi"] | |
>>> references = ["hello there", "general kenobi"] | |
>>> bleurt = load("bleurt", module_type="metric") | |
>>> results = bleurt.compute(predictions=predictions, references=references) | |
>>> print(results) | |
{'scores': [1.0295498371124268, 1.0445425510406494]} | |
``` | |
Example with the `"bleurt-base-128"` model checkpoint: | |
```python | |
>>> predictions = ["hello there", "general kenobi"] | |
>>> references = ["hello there", "general kenobi"] | |
>>> bleurt = load("bleurt", module_type="metric", checkpoint="bleurt-base-128") | |
>>> results = bleurt.compute(predictions=predictions, references=references) | |
>>> print(results) | |
{'scores': [1.0295498371124268, 1.0445425510406494]} | |
``` | |
## Limitations and Bias | |
The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected. | |
Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data. | |
Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue. | |
## Citation | |
```bibtex | |
@inproceedings{bleurt, | |
title={BLEURT: Learning Robust Metrics for Text Generation}, | |
author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh}, | |
booktitle={ACL}, | |
year={2020}, | |
url={https://arxiv.org/abs/2004.04696} | |
} | |
``` | |
## Further References | |
- The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/) | |