Model Card for MedGENIE-fid-flan-t5-base-medqa

MedGENIE comprises a collection of language models designed to utilize generated contexts, rather than retrieved ones, for addressing multiple-choice open-domain questions in the medical field. Specifically, MedGENIE-fid-flan-t5-base-medqa is a fusion-in-decoder (FID) model based on flan-t5-base, trained on the MedQA-USMLE dataset and grounded on artificial contexts generated by PMC-LLaMA-13B. This model achieves a new state-of-the-art (SOTA) performance over the corresponding test set.

Model description

Language(s) (NLP): English
License: MIT
Finetuned from model: google/flan-t5-base
Repository: https://github.com/disi-unibo-nlp/medgenie
Paper: To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering

Performance

At the time of release (February 2024), MedGENIE-fid-flan-t5-base-medqa is a new lightweight SOTA model on MedQA-USMLE benchmark:

Model	Ground (Source)	Learning	Params	Accuracy (↓)
MedGENIE-FID-Flan-T5	G (PMC-LLaMA)	Fine-tuned	250M	53.1
Codex (Liévin et al.)	∅	0-zhot	175B	52.5
Codex (Liévin et al.)	R (Wikipedia)	0-shot	175B	52.5
GPT-3.5-Turbo (Yang et al.)	R (Wikipedia)	k-shot	--	52.3
MEDITRON (Chen et al.)	∅	Fine-tuned	7B	52.0
BioMistral DARE (Labrak et al.)	∅	Fine-tuned	7B	51.1
BioMistral (Labrak et al.)	∅	Fine-tuned	7B	50.6
Zephyr-β	R (MedWiki)	2-shot	7B	50.4
BioMedGPT (Luo et al.)	∅	k-shot	10B	50.4
BioMedLM (Singhal et al.)	∅	Fine-tuned	2.7B	50.3
PMC-LLaMA (awq 4 bit)	∅	Fine-tuned	13B	50.2
LLaMA-2 (Chen et al.)	∅	Fine-tuned	7B	49.6
Zephyr-β	∅	2-shot	7B	49.6
Zephyr-β (Chen et al.)	∅	3-shot	7B	49.2
PMC-LLaMA (Chen et al.)	∅	Fine-tuned	7B	49.2
DRAGON (Yasunaga et al.)	R (UMLS)	Fine-tuned	360M	47.5
InstructGPT (Liévin et al.)	R (Wikipedia)	0-shot	175B	47.3
BioMistral DARE (Labrak et al.)	∅	3-shot	7B	47.0
Flan-PaLM (Singhal et al.)	∅	5-shot	62B	46.1
InstructGPT (Liévin et al.)	∅	0-shot	175B	46.0
VOD (Liévin et al. 2023)	R (MedWiki)	Fine-tuned	220M	45.8
Vicuna 1.3 (Liévin et al.)	∅	0-shot	33B	45.2
BioLinkBERT (Singhal et al.)	∅	Fine-tuned	340M	45.1
Mistral-Instruct	R (MedWiki)	2-shot	7B	45.1
BioMistral (Labrak et al.)	∅	3-shot	7B	44.4
Galactica	∅	0-shot	120B	44.4
LLaMA-2 (Liévin et al.)	∅	0-shot	70B	43.4
BioReader (Frisoni et al.)	R (PubMed-RCT)	Fine-tuned	230M	43.0
Guanaco (Liévin et al.)	∅	0-shot	33B	42.9
LLaMA-2-chat (Liévin et al.)	∅	0-shot	70B	42.3
Vicuna 1.5 (Liévin et al.)	∅	0-shot	65B	41.6
Mistral-Instruct (Chen et al.)	∅	3-shot	7B	41.1
PaLM (Singhal et al.)	∅	5-shot	62B	40.9
Guanaco (Liévin et al.)	∅	0-shot	65B	40.8
Falcon-Instruct (Liévin et al.)	∅	0-shot	40B	39.0
Vicuna 1.3 (Liévin et al.)	∅	0-shot	13B	38.7
GreaseLM (Zhang et al.)	R (UMLS)	Fine-tuned	359M	38.5
PubMedBERT (Singhal et al.)	∅	Fine-tuned	110M	38.1
QA-GNN (Yasunaga et al.)	R (UMLS)	Fine-tuned	360M	38.0
LLaMA-2 (Yang et al.)	R (Wikipedia)	k-shot	13B	37.6
LLaMA-2-chat	R (MedWiki)	2-shot	7B	37.2
LLaMA-2-chat	∅	2-shot	7B	37.2
BioBERT (Lee et al.)	∅	Fine-tuned	110M	36.7
MTP-Instruct (Liévin et al.)	∅	0-shot	30B	35.1
GPT-Neo (Singhal et al.)	∅	Fine-tuned	2.5B	33.3
LLaMa-2-chat (Liévin et al.)	∅	0-shot	13B	32.2
LLaMa-2 (Liévin et al.)	∅	0-shot	13B	31.1
GPT-NeoX (Liévin et al.)	∅	0-shot	20B	26.9

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
n_context: 5
per_gpu_batch_size: 1
accumulation_steps: 4
total_steps: 40,712
eval_freq: 10,178
optimizer: AdamW
scheduler: linear
weight_decay: 0.01
warmup_ratio: 0.1
text_maxlength: 1024

Bias, Risk and Limitation

Our model is trained on artificially generated contextual documents, which might inadvertently magnify inherent biases and depart from clinical and societal norms. This could lead to the spread of convincing medical misinformation. To mitigate this risk, we recommend a cautious approach: domain experts should manually review any output before real-world use. This ethical safeguard is crucial to prevent the dissemination of potentially erroneous or misleading information, particularly within clinical and scientific circles.

Citation

If you find MedGENIE-fid-flan-t5-base-medqa is useful in your work, please cite it with:

@misc{frisoni2024generate,
      title={To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering}, 
      author={Giacomo Frisoni and Alessio Cocchieri and Alex Presepi and Gianluca Moro and Zaiqiao Meng},
      year={2024},
      eprint={2403.01924},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

disi-unibo-nlp
/

MedGENIE-fid-flan-t5-base-medqa

Model Card for MedGENIE-fid-flan-t5-base-medqa

Model description

Performance

Training hyperparameters

Bias, Risk and Limitation

Citation

Dataset used to train disi-unibo-nlp/MedGENIE-fid-flan-t5-base-medqa

Collection including disi-unibo-nlp/MedGENIE-fid-flan-t5-base-medqa

MedGENIE