---
license: cc-by-nc-4.0
language:
- en
datasets:
- google/trueteacher
- anli
- cnn_dailymail
tags:
- natural-language-inference
- news-articles-summarization
---

# **TrueTeacher**

This is a **Factual Consistency Evaluation** model, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://arxiv.org/pdf/2305.11171.pdf).

## Model Details

The model is optimized for evaluating factual consistency in **summarization**.

It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a **T5-11B** [(Raffel
et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) fine-tuned with a mixture of the following datasets:
- TrueTeacher ([Gekhman et al., 2023](https://arxiv.org/pdf/2305.11171.pdf))
- ANLI ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))

The TrueTeacher dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) 
which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf). 
Summaries were generated using summarization models which trained on the **XSum** dataset [(Narayan207et  al.,  2018)](https://aclanthology.org/D18-1206.pdf).

The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY".

To accomodate the input length of common summarization datasets we recommend setting **max_length** to **2048**.

The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent).

## Evaluation results

This model achieves the following ROC AUC results on the summarization subset of the [TRUE benchmark (Honovich et al, 2022)](https://arxiv.org/pdf/2204.04991.pdf):

| **MNBM** | **QAGS-X** | **FRANK** | **SummEval** | **QAGS-C** | **Average** |
|----------|-----------|-----------|--------------|-----------|-------------|
| 78.1     | 89.4      | 93.6      | 88.5         | 89.4      | 87.8        |


## Intended Use

This model is intended for a research use (**non-commercial**) in English.

The reccomended use case is evaluating factual consistency in summarization.

## Out-of-scope use 
Any use cases which violate the **cc-by-nc-4.0** license.

Usage in languages other than English. 

## Usage examples

#### classification
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '1'), 
                             ('the cat is shiny', '0')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  outputs = model.generate(input_ids)
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'result: {result} (expected: {expected})\n')
```

#### scoring
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
import torch

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'), 
                             ('the cat is shiny', '<< 0.5')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
  outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
  logits = outputs.logits
  probs = torch.softmax(logits[0], dim=-1)
  one_token_id = tokenizer('1').input_ids[0]
  entailment_prob = probs[0, one_token_id].item()
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'score: {entailment_prob:.3f} (expected: {expected})\n')
```

## Citation

If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum and T5 papers mentioned above.

```
@misc{gekhman2023trueteacher,
      title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, 
      author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
      year={2023},
      eprint={2305.11171},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```