File size: 4,972 Bytes
ed578a0 aaa466a 3e7b309 5a7c540 3e7b309 5e6dbee 253a185 9054912 ed578a0 b24b964 ffbbef2 ed578a0 b24b964 ed578a0 4844daa f5edeb6 3824639 ed578a0 8b504b5 7ce66f9 ed578a0 f637b66 ed578a0 b24b964 aaa466a b24b964 466b3db aaa466a e21bac2 aaa466a b24b964 8e50991 b24b964 8e50991 f637b66 8e50991 b24b964 f637b66 8e50991 23c24aa ed578a0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
license: cc-by-nc-4.0
language:
- en
datasets:
- google/trueteacher
- anli
- cnn_dailymail
tags:
- natural-language-inference
- news-articles-summarization
---
# **TrueTeacher**
This is a **Factual Consistency Evaluation** model, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://aclanthology.org/2023.emnlp-main.127.pdf).
## Model Details
The model is optimized for evaluating factual consistency in **summarization**.
It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a **T5-11B** [(Raffel
et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) fine-tuned with a mixture of the following datasets:
- [TrueTeacher](https://huggingface.co/datasets/google/trueteacher) ([Gekhman et al., 2023](https://arxiv.org/pdf/2305.11171.pdf))
- [ANLI](https://huggingface.co/datasets/anli) ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
The TrueTeacher dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf)
which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf).
Summaries were generated using summarization models which were trained on the **XSum** dataset [(Narayan et al., 2018)](https://aclanthology.org/D18-1206.pdf).
The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY".
To accomodate the input length of common summarization datasets we recommend setting **max_length** to **2048**.
The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent).
## Evaluation results
This model achieves the following ROC AUC results on the summarization subset of the [TRUE benchmark (Honovich et al, 2022)](https://arxiv.org/pdf/2204.04991.pdf):
| **MNBM** | **QAGS-X** | **FRANK** | **SummEval** | **QAGS-C** | **Average** |
|----------|-----------|-----------|--------------|-----------|-------------|
| 78.1 | 89.4 | 93.6 | 88.5 | 89.4 | 87.8 |
## Intended Use
This model is intended for a research use (**non-commercial**) in English.
The recommended use case is evaluating factual consistency in summarization.
## Out-of-scope use
Any use cases which violate the **cc-by-nc-4.0** license.
Usage in languages other than English.
## Usage examples
#### classification
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)
premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '1'),
('the cat is shiny', '0')]:
input_ids = tokenizer(
f'premise: {premise} hypothesis: {hypothesis}',
return_tensors='pt',
truncation=True,
max_length=2048).input_ids
outputs = model.generate(input_ids)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f'premise: {premise}')
print(f'hypothesis: {hypothesis}')
print(f'result: {result} (expected: {expected})\n')
```
#### scoring
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
import torch
model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)
premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'),
('the cat is shiny', '<< 0.5')]:
input_ids = tokenizer(
f'premise: {premise} hypothesis: {hypothesis}',
return_tensors='pt',
truncation=True,
max_length=2048).input_ids
decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
logits = outputs.logits
probs = torch.softmax(logits[0], dim=-1)
one_token_id = tokenizer('1').input_ids[0]
entailment_prob = probs[0, one_token_id].item()
print(f'premise: {premise}')
print(f'hypothesis: {hypothesis}')
print(f'score: {entailment_prob:.3f} (expected: {expected})\n')
```
## Citation
If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum, T5 and FLAN papers mentioned above.
```
@misc{gekhman2023trueteacher,
title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models},
author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
year={2023},
eprint={2305.11171},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |