Edit model card

SapBERT-DE is a model for German biomedical entity linking which is obtained by fine-tuning multilingual entity linking model cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR using a German biomedical entity linking knowledge base named UMLS-Wikidata.

Usage

import numpy as np
from tqdm import tqdm
import torch
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("permediq/SapBERT-DE", use_fast=True)  
model = AutoModel.from_pretrained("permediq/SapBERT-DE").cuda()

# entity descriptions to embed
entity_descriptions = ["Cerebellum", "Zerebellum", "Kleinhirn", "Anaesthesie"]

bs = 32 # batch size 
all_embs = []
for i in tqdm(np.arange(0, len(entity_descriptions), bs)):
    toks = tokenizer.batch_encode_plus(entity_descriptions[i:i+bs], 
                                       padding="max_length", 
                                       max_length=40, # model trained with 40 max_length 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] 
    all_embs.append(cls_rep.cpu().detach())

all_embs = torch.cat(all_embs)

def cos_sim(a, b):
    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))

# cosine similarity of first entity with all the entities
print(cos_sim(all_embs[0].unsqueeze(0), all_embs))

# >>> tensor([[1.0000, 0.9337, 0.6206, 0.2086]])

BibTeX

@inproceedings{mustafa-etal-2024-leveraging,
    title = "Leveraging {W}ikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for {G}erman",
    author = "Mustafa, Faizan E  and
      Dima, Corina  and
      Ochoa, Juan  and
      Staab, Steffen",
    booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.clinicalnlp-1.17",
    pages = "202--207", 
Downloads last month
5
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.