Lihuchen/AcroBERT · Hugging Face

AcroBERT can do end-to-end acronym linking (see the Demo here). Given a sentence, our framework first recognize acronyms by using MadDog, and then disambiguate them by using AcroBERT:

from inference.acrobert important acronym_linker

# input sentence with acronyms, the maximum length is 400 sub-tokens
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."

# mode = ['acrobert', 'pop']
# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
results = acronym_linker(sentence, mode='acrobert')
print(results)

## expected output: [('NCBI', 'National Center for Biotechnology Information')]

Github: https://github.com/tigerchen52/GLADIS

Model: [https://zenodo.org/record/7568937#.Y9vtrXaZMuU]

Apart from the AcroBERT, we constructed a new benchmark named GLADIS for accelerating the research on acronym disambiguation, which contains the below data:

	Source	Desc
Acronym Dictionary	Pile (MIT license), Wikidata, UMLS	1.6 million acronyms and 6.4 million long forms
Three Datasets	WikilinksNED Unseen, SciAD(CC BY-NC-SA 4.0), Medmentions(CC0 1.0)	three AD datasets that cover general, scientific, biomedical domains
A Pre-training Corpus	Pile (MIT license)	160 million sentences with acronyms

usage

git clone https://github.com/tigerchen52/GLADIS.git
download the acronym dictionary and AcroBERT, and put them into this path: input/
use the function inference.acrobert.acronym_linker() to do end-to-end acronym linking.

citation

@inproceedings{chen2023gladis,
  title={GLADIS: A General and Large Acronym Disambiguation Benchmark},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics},
  year={2023}
}