Clarification on the AnyScale accuracy score benchmark

#6
by pjoshi30 - opened

Hi Team,

We are working on publishing a comparison between different hallucination evaluation techniques (including our own). We have been able to reproduce your results on the TRUE and SummaC datasets. However, we were unable to reproduce the 86% accuracy results using the Anyscale Ranking test. I am looking for clarification on how you performed this test. My understanding is as follows:

  • Start with the sentence pairs dataset as used in the Anyscale notebook.
  • Use the article_sent column as the source of truth.
  • Check the correct_sent column against the article_sent column for consistency
  • Repeat the same for the incorrect_sent column
  • Calculate accuracy based on the number of right answers. The dataset size is 373 rows that translates to 746 items (since we are using the correct_sent and the incorrect_sent from each row).

Based on this method above, we are seeing an accuracy of 66.35% using the existing Vectara model. Let me know if I am missing something here or if you used a different dataset/methodology to calculate the accuracy.

Thank you!

Vectara org
edited Mar 6

Hello Preetam,

I was able to replicate the results Simon achieved. The key is the following sentence in the prompt used by Anyscale:

Decide which of the following summary is more consistent with the article sentence.
Note that consistency means all information in the summary is supported by the article. 

Crucially, the LLMs are asked to make a relative comparison about consistency, not an absolute judgement. The former is an easier problem than the latter.
In other words, given two summaries, summary_good and summary_bad, the accuracy is the percentage of cases where hhem(summary_good) > hhem(summary_bad).

While investigating the issue you raised, I also realized that the following rows of the dataset are invalid because the correct and incorrect sentences are identical: 44, 180, 328.

Hi Amin,

Thank you for looking into this. This makes sense. It would be great to clarify this point in the model card that this is the accuracy on a relative comparison (more similar to ranking tasks).

Also, I think it will be great to open source the code for these benchmarks (either in a notebook or a library format). We are happy to do that - let me know if you are interested in collaborating on this.

Thanks,
Preetam

Vectara org

Also, I think it will be great to open source the code for these benchmarks (either in a notebook or a library format). We are happy to do that - let me know if you are interested in collaborating on this.

That sounds like an interesting idea. Which benchmarks were you specifically referring to? The three mentioned in the model card, or the Hallucination Leaderboard?
For context, Simon, the research lead, sadly passed away unexpectedly last November. Without him, our team has been stretched thin, and we haven't been able to actively push forward all his work.

The three mentioned in the model card, or the Hallucination Leaderboard?

Yes the three mentioned in the model card.

I did hear about the really unfortunate news about Simon. It was very sad to hear. I understand that you all are short on resourcing. Our team can take a first pass at this - we can sync offline to figure out how to collaborate, if possible.

pjoshi30 changed discussion status to closed
Vectara org

That sounds good. You can reach me directly at [email protected]. I look forward to hearing from you there.

Sign up or log in to comment