DiLBERT (Disease Language BERT)
The objective of this model was to obtain a specialized disease-related language, trained from scratch.
We created a pre-training corpora starting from ICD-11 entities, and enriched it with documents from PubMed and Wikipedia related to the same entities.
Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).
Model released with the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP".
To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.
Composition of the pretraining corpus
Source | Documents | Words |
---|---|---|
ICD-11 descriptions | 34,676 | 1.0 million |
PubMed Title and Abstracts | 852,550 | 184.6 million |
Wikipedia pages | 37,074 | 6.1 million |
Main repository
For more details check the main repo https://github.com/KevinRoitero/dilbert
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")
How to cite
@article{roitero2021dilbert,
title={{DilBERT}: Cheap Embeddings for Disease Related Medical NLP},
author={Roitero, Kevin and Portelli, Beatrice and Popescu, Mihai Horia and Della Mea, Vincenzo},
journal={IEEE Access},
volume={},
pages={},
year={2021},
publisher={IEEE},
note = {In Press}
}
- Downloads last month
- 21