vectara
/

hallucination_evaluation_model

@@ -12,7 +12,7 @@ pipline_tag: text-classficiation
 * HHEM-2.1-Open outperforms GPT-3.5-Turbo and even GPT-4.
 * HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 seconds for a 2k-token input on a modern x86 CPU.
-**To HHEM-1.0 users**: HHEM-2.1-Open introduces breaking changes to the usage. Please update your code according to the [new usage](#using-hhem-21-open) below. We are working making it compatible with `transformers.pipeline` and HuggingFace's Inference Endpoint. We apologize for the inconvenience.
 HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
@@ -26,23 +26,24 @@ By "hallucinated" or "factually inconsistent", we mean that a text (hypothesis,
 A common type of hallucination in RAG is **factual but hallucinated**.
 For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
-## Using HHEM-2.1-Open with `transformers`
-HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
-**Using with `Auto` class**
-HHEM-2.1 has some breaking change from HHEM-1.0. Your previous code will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
-HHEM-2.1-Open can be loaded easily using the `transformers` library. Just remember to set `trust_remote_code=True` to take advantage of the pre-/post-processing code we provided for your convenience. The **input** of the model is a list of pairs of (premise, hypothesis). For each pair, the model will **return** a score between 0 and 1, where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
 ```python
 from transformers import AutoModelForSequenceClassification
-# Load the model
-model = AutoModelForSequenceClassification.from_pretrained(
-    'vectara/hallucination_evaluation_model', trust_remote_code=True)
 pairs = [ # Test data, List[Tuple[str, str]]
     ("The capital of France is Berlin.", "The capital of France is Paris."), # factual but hallucinated
     ('I am in California', 'I am in United States.'), # Consistent
@@ -53,20 +54,23 @@ pairs = [ # Test data, List[Tuple[str, str]]
     ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
 ]
-# Use the model to predict
 model.predict(pairs) # note the predict() method. Do not do model(pairs).
 # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
 ```
-**Using with `text-classification` pipeline**
-Please note that when using `text-classification` pipeline for prediction, scores for two labels will be returned for each pair. The score for **consistent** label is the one that should be focused on.
 ```python
 from transformers import pipeline, AutoTokenizer
-pairs = [
     ("The capital of France is Berlin.", "The capital of France is Paris."),
     ('I am in California', 'I am in United States.'),
     ('I am in United States', 'I am in California.'),
@@ -76,7 +80,7 @@ pairs = [
     ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
 ]
-# Apply prompt to pairs
 prompt = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"
 input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs]
@@ -87,29 +91,16 @@ classifier = pipeline(
             tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
             trust_remote_code=True
         )
-classifier(input_pairs, return_all_scores=True)
-# output
-# [[{'label': 'hallucinated', 'score': 0.9889384508132935},
-#   {'label': 'consistent', 'score': 0.011061512865126133}],
-#  [{'label': 'hallucinated', 'score': 0.35263675451278687},
-#   {'label': 'consistent', 'score': 0.6473632454872131}],
-#  [{'label': 'hallucinated', 'score': 0.870982825756073},
-#   {'label': 'consistent', 'score': 0.1290171593427658}],
-#  [{'label': 'hallucinated', 'score': 0.1030581071972847},
-#   {'label': 'consistent', 'score': 0.8969419002532959}],
-#  [{'label': 'hallucinated', 'score': 0.8153750896453857},
-#   {'label': 'consistent', 'score': 0.18462494015693665}],
-#  [{'label': 'hallucinated', 'score': 0.9949689507484436},
-#   {'label': 'consistent', 'score': 0.005031010136008263}],
-#  [{'label': 'hallucinated', 'score': 0.9456764459609985},
-#   {'label': 'consistent', 'score': 0.05432349815964699}]]
-```
-You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore this warning for now. It is a notification inherited from the foundation, T5-base.
-Note that the order of a pair is important. For example, the 2nd and 3rd examples in the `pairs` list are consistent and hallucinated, respectively.
 ## HHEM-2.1-Open vs. HHEM-1.0

 * HHEM-2.1-Open outperforms GPT-3.5-Turbo and even GPT-4.
 * HHEM-2.1-Open can be run on consumer-grade hardware, occupying less than 600MB RAM space at 32-bit precision and elapsing around 1.5 seconds for a 2k-token input on a modern x86 CPU.
+> HHEM-2.1-Open introduces breaking changes to the usage. Please update your code according to the [new usage](#using-hhem-21-open) below. We are working making it compatible with HuggingFace's Inference Endpoint. We apologize for the inconvenience.
 HHEM-2.1-Open is a major upgrade to [HHEM-1.0-Open](https://huggingface.co/vectara/hallucination_evaluation_model/tree/hhem-1.0-open) created by [Vectara](https://vectara.com) in November 2023. The HHEM model series are designed for detecting hallucinations in LLMs. They are particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, and HHEM can be used to measure the extent to which this summary is factually consistent with the facts.
 A common type of hallucination in RAG is **factual but hallucinated**.
 For example, given the premise _"The capital of France is Berlin"_, the hypothesis _"The capital of France is Paris"_ is hallucinated -- although it is true in the world knowledge. This happens when LLMs do not generate content based on the textual data provided to them as part of the RAG retrieval process, but rather generate content based on their pre-trained knowledge.
+Additionally, hallucination detection is "asymmetric" or is not commutative. For example, the hypothesis _"I visited Iowa"_ is considered hallucinated given the premise _"I visited the United States"_, but the reverse is consistent.
+## Using HHEM-2.1-Open
+> HHEM-2.1 has some breaking change from HHEM-1.0. Your code that works with HHEM-1 (November 2023) will not work anymore. While we are working on backward compatibility, please follow the new usage instructions below.
+Here we provide several ways to use HHEM-2.1-Open in the `transformers` library.
+> You may run into a warning message that "Token indices sequence length is longer than the specified maximum sequence length". Please ignore it which is inherited from the foundation, T5-base.
+### Using with `AutoModel`
+This is the most end-to-end and out-of-the-box way to use HHEM-2.1-Open. It takes a list of pairs of (premise, hypothesis) as the input and returns a score between 0 and 1 for each parir where 0 means that the hypothesis is not evidenced at all by the premise and 1 means the hypothesis is fully supported by the premise.
 ```python
 from transformers import AutoModelForSequenceClassification
 pairs = [ # Test data, List[Tuple[str, str]]
     ("The capital of France is Berlin.", "The capital of France is Paris."), # factual but hallucinated
     ('I am in California', 'I am in United States.'), # Consistent
     ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
 ]
+# Step 1: Load the model
+model = AutoModelForSequenceClassification.from_pretrained(
+    'vectara/hallucination_evaluation_model', trust_remote_code=True)
+# Step 2: Use the model to predict
 model.predict(pairs) # note the predict() method. Do not do model(pairs).
 # tensor([0.0111, 0.6474, 0.1290, 0.8969, 0.1846, 0.0050, 0.0543])
 ```
+### Using with `pipeline`
+In the popular  `pipeline` class of the `transformers` library, you have to manually prepare the data using the prompt template in which we trained the model. HHEM-2.1-Open has two output neurons, corresponding to the labels `hallucinated` and `consistent` respectively. In the example below, we will ask `pipeline` to return the scores for both labels (by setting `top_k=None`, formerly `return_all_scores=True`) and then extract the score for the `consistent` label.
 ```python
 from transformers import pipeline, AutoTokenizer
+pairs = [ # Test data, List[Tuple[str, str]]
     ("The capital of France is Berlin.", "The capital of France is Paris."),
     ('I am in California', 'I am in United States.'),
     ('I am in United States', 'I am in California.'),
     ("Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg.")
 ]
+# Prompt the pairs
 prompt = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"
 input_pairs = [prompt.format(text1=pair[0], text2=pair[1]) for pair in pairs]
             tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
             trust_remote_code=True
         )
+full_scores = classifier(input_pairs, top_k=None) # List[List[Dict[str, float]]]
+# Optional: Extract the scores for the 'consistent' label
+simple_scores = [score_dict['score'] for score_dict in score_for_both_labels for score_for_both_labels in full_scores if score_dict['label'] == 'consistent']
+print(simple_scores)
+# Expected output: [0.011061512865126133, 0.6473632454872131, 0.1290171593427658, 0.8969419002532959, 0.18462494015693665, 0.005031010136008263, 0.05432349815964699]
+```
+Of course, with `pipeline`, you can also get the most likely label, or the label with the highest score, by setting `top_k=1`.
 ## HHEM-2.1-Open vs. HHEM-1.0