Post
924
π§ͺ RAG Evaluation with π₯ Prometheus 2 + Haystack
π Blog post: https://haystack.deepset.ai/blog/rag-evaluation-with-prometheus-2
π Notebook: https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/prometheus2_evaluation.ipynb
βββ ββ ββ β βββ
When evaluating LLMs' responses, π©π«π¨π©π«π’ππππ«π² π¦π¨πππ₯π¬ like GPT-4 are commonly used due to their strong performance.
However, relying on closed models presents challenges related to data privacy π, transparency, controllability, and cost πΈ.
On the other hand, π¨π©ππ§ π¦π¨πππ₯π¬ typically do not correlate well with human judgments and lack flexibility.
π₯ Prometheus 2 is a new family of open-source models designed to address these gaps:
πΉ two variants: prometheus-eval/prometheus-7b-v2.0; prometheus-eval/prometheus-8x7b-v2.0
πΉ trained on open-source data
πΉ high correlation with human evaluations and proprietary models
πΉ highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.
See my experiments with RAG evaluation in the links above.
π Blog post: https://haystack.deepset.ai/blog/rag-evaluation-with-prometheus-2
π Notebook: https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/prometheus2_evaluation.ipynb
βββ ββ ββ β βββ
When evaluating LLMs' responses, π©π«π¨π©π«π’ππππ«π² π¦π¨πππ₯π¬ like GPT-4 are commonly used due to their strong performance.
However, relying on closed models presents challenges related to data privacy π, transparency, controllability, and cost πΈ.
On the other hand, π¨π©ππ§ π¦π¨πππ₯π¬ typically do not correlate well with human judgments and lack flexibility.
π₯ Prometheus 2 is a new family of open-source models designed to address these gaps:
πΉ two variants: prometheus-eval/prometheus-7b-v2.0; prometheus-eval/prometheus-8x7b-v2.0
πΉ trained on open-source data
πΉ high correlation with human evaluations and proprietary models
πΉ highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.
See my experiments with RAG evaluation in the links above.