To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
Abstract
Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, "to generate or to retrieve" is the modern equivalent of Hamlet's dilemma. This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MedGENIE sets a new state-of-the-art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706times fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy.
Community
Our work introduces ππππππππ, a novel generate-then-read framework for medical multiple-choice question answering. Unlike traditional retrieve-then-read methods, MedGENIE grounds generalist LLMs and SLMs on multi-view contexts generated by a medical LLM. To foster accessibility and match prevalent hardware configurations, we assume a low-resource infrastructure with 24GB VRAM.
π Results:
π We evaluate MedGENIE on three standard ODQA benchmarks designed to quantify professional medical competencies: MedQA-USMLE, MedMCQA, and MMLU-Medical.
π MedGENIE greatly improves non-grounded LLaMA-3-Instruct, Phi-3-mini, LLaMA-2-chat, and Zephyr-Ξ² up to +11.7 points on average. MedGENIE-Phi-3-mini surpasses the strongest fine-tuned alternative, MEDITRON, by +12.7 in MedQA.
π By fine-tuning the reader, MedGENIE allows Flan-T5-base to outcompete closed-book zero-shot 175B LLMs and supervised 10B baselines on MedQA, using up to 706x fewer parameters.
π Our research demonstrates a clear inclination of cutting-edge rerankers (BGE-large) towards favouring generated contexts over retrieved ones.
π When treated as knowledge sources or incorporated into human-curated ones, artificial passages notably enhance the effectiveness of retrieve-then-read workflows (up to about 6 extra points).
π RAGAS evaluation (gpt-4-turbo-2024-04-09) shows that the generated contexts are significantly better than the retrieved ones: up to +39.3 in context precision, +27.2 in context recall, +35.9 in faithfulness.
Models citing this paper 2
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper