Danish medical word embeddings

MeDa-We was trained on a Danish medical corpus of 123M tokens. The word embeddings are 300-dimensional and are trained using FastText.

The embeddings were trained for 10 epochs using a window size of 5 and 10 negative samples.

The development of the corpus and word embeddings is described further in our paper.

We also trained a transformer model on the developed corpus which can be found here.

Citing

@inproceedings{pedersen-etal-2023-meda,
    title = "{M}e{D}a-{BERT}: A medical {D}anish pretrained transformer model",
    author = "Pedersen, Jannik  and
      Laursen, Martin  and
      Vinholt, Pernille  and
      Savarimuthu, Thiusius Rajeeth",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.31",
    pages = "301--307",
}