AcroBERT can do end-to-end acronym linking (see the Demo here). Given a sentence, our framework first recognize acronyms by using MadDog, and then disambiguate them by using AcroBERT:
from inference.acrobert important acronym_linker
# input sentence with acronyms, the maximum length is 400 sub-tokens
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."
# mode = ['acrobert', 'pop']
# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
results = acronym_linker(sentence, mode='acrobert')
print(results)
## expected output: [('NCBI', 'National Center for Biotechnology Information')]
Github: https://github.com/tigerchen52/GLADIS
Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]
Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:
Source | Desc | |
---|---|---|
Acronym Dictionary | Pile (MIT license), Wikidata, UMLS | 1.6 million acronyms and 6.4 million long forms |
Three Datasets | WikilinksNED Unseen, SciAD(CC BY-NC-SA 4.0), Medmentions(CC0 1.0) | three AD datasets that cover general, scientific, biomedical domains |
A Pre-training Corpus | Pile (MIT license) | 160 million sentences with acronyms |
usage
- git clone https://github.com/tigerchen52/GLADIS.git
- download the acronym dictionary and AcroBERT, and put them into this path:
input/
- use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.
citation
@inproceedings{chen2023gladis,
title={GLADIS: A General and Large Acronym Disambiguation Benchmark},
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics},
year={2023}
}