metadata

license: cc-by-nc-4.0
language:
  - en
tags:
  - summarization

TrueTeacher

This is a Factual Consistency Evaluation model, introduced in the TrueTeacher paper (Gekhman et al, 2023).

Model Details

The model is optimized for evaluating factual consistency in summarization.

It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a T5-11B (Raffel et al., 2020) fine-tuned with a mixture of the following datasets:

TrueTeacher (Gekhman et al., 2023)
ANLI (Nie et al., 2020)

The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY".

To accomodate the input length of common summarization datasets we recommend setting max_length to 2048.

The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent).

Evaluation results

This model achieves the following ROC AUC results on the summarization subset of the TRUE benchmark (Honovich et al, 2022):

MNBM	QAGS-X	FRANK	SummEval	QAGS-C	Average
78.1	89.4	93.6	88.5	89.4	87.8

Intended Use

This model is intended for a research use (non-commercial) in English.

The reccomended use case is evaluating factual consistency in summarization.

Out-of-scope use

Any use cases which violate the cc-by-nc-4.0 license.

Usage in languages other than English.

Usage examples

classification

from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '1'), 
                             ('the cat is shiny', '0')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  outputs = model.generate(input_ids)
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'result: {result} (expected: {expected})\n')

scoring

from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
import torch

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'), 
                             ('the cat is shiny', '<< 0.5')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
  outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
  logits = outputs.logits
  probs = torch.softmax(logits[0], dim=-1)
  one_token_id = tokenizer('1').input_ids[0]
  entailment_prob = probs[0, one_token_id].item()
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'score: {entailment_prob:.3f} (expected: {expected})\n')

Citation

If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI and T5 papers mentioned above.

@misc{gekhman2023trueteacher,
      title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, 
      author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
      year={2023},
      eprint={2305.11171},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}