Spaces:

xlreator
/

SNOMED-Entity-Linking

Sleeping

App Files Files Community

xlreator commited on Sep 1

Commit

67eaf9f

•

1 Parent(s): 645400e

initial commit

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +28 -13
app.py +98 -0
assests/screenshot.png +0 -0
dataloader.py +18 -0
requirements.txt +5 -0
segmentation.py +90 -0
utils.py +79 -0
vectors.kv +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+vectors.kv filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,28 @@
----
-title: SNOMED Entity Linking
-emoji: 🏃
-colorFrom: green
-colorTo: gray
-sdk: gradio
-sdk_version: 4.42.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# SNOMED-Entity-Linking
+A [Gradio](https://www.gradio.app/) app for Entity linking on the [SNOMED CT](https://www.snomed.org/five-step-briefing), a knowledge graph of clinical healthcare terminology.
+![](assests/screenshot.png)
+## Motivation
+Much of the world's healthcare data is stored in free-text documents, usually clinical notes taken by doctors. This unstructured data can be challenging to analyze and extract meaningful insights from.
+However, by applying a standardized terminology like SNOMED CT, we can the interpretability of these notes for patients and individuals outside the organization of origin.
+Moreover, healthcare organizations can convert this free-text data into a structured format that can be readily analyzed by computers, in turn stimulating the development of new medicines, treatment pathways, and better patient outcomes.
+Here, we use entity linking to analyze clinical notes identifing and labeling the portions of each note that correspond to specific medical concepts.
+# Methodology
+The pipline involves two models, one for segmentation and the other for disambiguation (classification of the segmentations).
+The segmentation model is a [CANINE-s](https://huggingface.co/google/canine-s) character-level transformer model finetuned to optimise the BCE, Dice, and Focal loss each weighted 1, 1, .1 respectively. The objective function is then optimised using Adam with a learning rate of 1e-5.
+The classification model uses the [BioBERT](https://huggingface.co/dmis-lab/biosyn-biobert-bc5cdr-disease) model. Here, the model is trained similarly using Adan and a learning rate of 2e-5. We train using the [MultipleNegativesRankingLoss](https://arxiv.org/pdf/1705.00652) using the [SentenceTransformers](https://sbert.net/) library.
+## Dataset
+The dataset used to train the models is the dataset used for the [SNOMED CT Entity Linking Challenge](https://physionet.org/content/snomed-ct-entity-challenge/1.0.0/), which is a subset of [MIMIC-IV-Note](https://physionet.org/content/mimic-iv-note/2.2/) of 75,000 entity annotations across about 300 discharge notes.
+For the sake of simplicity we only include entities with more than 10 mentions.
+## References
+- Hardman, W., Banks, M., Davidson, R., Truran, D., Ayuningtyas, N. W., Ngo, H., Johnson, A., & Pollard, T. (2023). SNOMED CT Entity Linking Challenge (version 1.0.0). PhysioNet. https://doi.org/10.13026/s48e-sp45.
+- Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
+- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics, Volume 36, Issue 4, February 2020, Pages 1234–1240, https://doi.org/10.1093/bioinformatics/btz682
+- Henderson, M., Al-Rfou, R., Strope, B., Sung, Y., Lukács, L., Guo, R., Kumar, S., Miklos, B., & Kurzweil, R. (2017). Efficient Natural Language Response Suggestion for Smart Reply. ArXiv, abs/1705.00652.
+- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Conference on Empirical Methods in Natural Language Processing.

app.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import torch
+import pandas as pd
+import configparser
+import gradio as gr
+from gensim.models import KeyedVectors
+from sentence_transformers import SentenceTransformer
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+from segmentation import segment
+from utils import clean_entity
+class Linker:
+    def __init__(self, config: dict[str, object],
+                 context_window_width: int = -1):
+        self._vectors = None
+        self._emb_model = None
+        if context_window_width <= 0:
+            context_window_width = config['context_window_width']
+        self.context_window_width = context_window_width
+        self.config = config
+    def add_context(self, row: pd.Series) -> str:
+        window_start = max(0, row.start - self.context_window_width)
+        window_end = min(row.end + self.context_window_width, len(row.text))
+        return clean_entity(row.text[window_start:window_end])
+    def _load_embeddings(self):
+        self._vectors = KeyedVectors.load(self.config['keyed_vectors_file'])
+    def _load_model(self):
+        self._emb_model = SentenceTransformer(config['embedding_model'])
+    @property
+    def embeddings(self):
+        if self._vectors is None:
+            self._load_embeddings()
+        return self._vectors
+    @property
+    def embedding_model(self):
+        if self._emb_model is None:
+            self._load_model()
+        return self._emb_model
+    def link(self, df: pd.DataFrame) -> list[dict]:
+        mention_emb = self.embedding_model.encode(df.mention.str.lower().values)
+        concepts = [self.embeddings.most_similar(m, topn=1)[0][0]
+                    for m in mention_emb]
+        return concepts
+def highlight_text(spans: pd.DataFrame, text: str) -> list[tuple[str, object]]:
+    token_concepts = [None for _ in text]
+    for row in spans.itertuples():
+        for k in range(row.start, row.end):
+            token_concepts[k] = row.concept
+    return list(zip(list(text), token_concepts))
+def entity_link(query: str) -> list[tuple[str, object]]:
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    seg_model = AutoModelForTokenClassification.from_pretrained(
+        config['segmentation_model']
+    )
+    seg_tokenizer = AutoTokenizer.from_pretrained(
+        config['segmentation_tokenizer']
+    )
+    thresh = float(config['thresh'])
+    query_df = pd.DataFrame({'note_id': [0], 'text': [query]})
+    seg = segment(query_df, seg_model, seg_tokenizer, device, thresh)
+    linked_concepts = []
+    if len(seg) > 0:
+        seg = seg.sort_values('start')
+        linked_concepts = linker.link(seg)
+    seg['concept'] = linked_concepts
+    return highlight_text(seg, query)
+config_parser = configparser.ConfigParser()
+config_parser.read('config.ini')
+config = config_parser['DEFAULT']
+linker = Linker(config)
+demo = gr.Interface(
+    fn=entity_link,
+    inputs=["text"],
+    outputs=gr.HighlightedText(
+        label="linking",
+        combine_adjacent=True,
+    ),
+    theme=gr.themes.Base()
+)

assests/screenshot.png ADDED Viewed

dataloader.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import torch
+from torch.utils.data import DataLoader
+class TestDataset(torch.utils.data.Dataset):
+    def __init__(self, encodings: list[dict[str, list]]):
+        self.encodings = encodings
+    def __getitem__(self, idx):
+        item = {key: torch.tensor(val) for key, val in self.encodings[idx].items()}
+        return item
+    def __len__(self):
+        return len(self.encodings)
+def create_dataloader(dat: list[dict[str, list]], batch_size: int) -> DataLoader:
+    return DataLoader(TestDataset(dat), batch_size=batch_size, shuffle=False)

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+torch==2.2.1
+pandas==2.2.0
+sentence_transformers==2.6.1
+transformers==4.39.1
+numpy==1.26.4

segmentation.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import torch.nn.functional as F
+import pandas as pd
+from dataloader import create_dataloader
+from utils import *
+def predict_segmentation(inp, model, device, batch_size=8):
+    test_loader = create_dataloader(inp, batch_size)
+    predictions = []
+    for batch in test_loader:
+        batch = {k: v.to(device) for k, v in batch.items()}
+        p = F.sigmoid(model(**batch).logits).detach().cpu().numpy()
+        predictions.append(p)
+    return np.concatenate(predictions, axis=0)
+def create_data(text, tokenizer, seq_len=512):
+    tokens = tokenizer(text, add_special_tokens=False)
+    _token_batches = {k: [pad_seq(x, seq_len) for x in batch_list(v, seq_len)]
+                      for (k, v) in tokens.items()}
+    n_batches = len(_token_batches['input_ids'])
+    return [{k: v[i] for k, v in _token_batches.items()}
+            for i in range(n_batches)]
+def segment_tokens(notes, model, tokenizer, device, batch_size=8):
+    predictions = {}
+    for note in notes.itertuples():
+        note_id = note.note_id
+        raw_text = note.text.lower()
+        inp = create_data(raw_text, tokenizer)
+        pred_probs = predict_segmentation(inp, model, device, batch_size=batch_size)
+        pred_probs = np.squeeze(pred_probs, -1)
+        pred_probs = np.concatenate(pred_probs)
+        predictions[note_id] = pred_probs
+    return predictions
+def segment(notes, model, tokenizer, device, thresh, batch_size=8):
+    predictions = []
+    predictions_prob_map = segment_tokens(notes, model, tokenizer, device, batch_size)
+    for note in notes.itertuples():
+        note_id = note.note_id
+        raw_text = note.text
+        decoded_text = tokenizer.decode(tokenizer.encode(raw_text, add_special_tokens=False))
+        pred_probs = predictions_prob_map[note_id]
+        _, pred_probs = align_decoded(raw_text, decoded_text, pred_probs)
+        pred_probs = np.array(pred_probs, 'float32')
+        pred = (pred_probs > thresh).astype('uint8')
+        spans = get_sequential_spans(pred)
+        note_predictions = {'note_id': [], 'start': [], 'end': [], 'mention': [], 'score': []}
+        for (start, end) in spans:
+            note_predictions['note_id'].append(note_id)
+            note_predictions['score'].append(pred_probs[start:end].mean())
+            note_predictions['start'].append(start)
+            note_predictions['end'].append(end)
+            note_predictions['mention'].append(raw_text[start:end])
+        note_predictions = pd.DataFrame(note_predictions)
+        note_predictions = note_predictions.sort_values('score', ascending=False)
+        # remove overlapping spans
+        seen_spans = set()
+        unseen = []
+        for span in note_predictions[['start', 'end']].values:
+            span = tuple(span)
+            s = False
+            if not is_overlap(seen_spans, span):
+                seen_spans.add(span)
+                s = True
+            unseen.append(s)
+        note_predictions = note_predictions[unseen]
+        predictions.append(note_predictions)
+    predictions = pd.concat(predictions).reset_index(drop=True)
+    return predictions

utils.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import numpy as np
+def is_overlap(existing_spans, new_span):
+    for span in existing_spans:
+        # Check if either end of the new span is within an existing span
+        if (span[0] <= new_span[0] <= span[1]) or \
+                (span[0] <= new_span[1] <= span[1]):
+            return True
+        # Check if the new span entirely covers an existing span
+        if new_span[0] <= span[0] and new_span[1] >= span[1]:
+            return True
+    return False
+def get_sequential_spans(a):
+    spans = []
+    prev = False
+    start = 0
+    for i, x in enumerate(a):
+        if not prev and x:
+            start = i
+        elif prev and not x:
+            spans.append((start, i))
+        prev = x
+    if x:
+        spans.append((start, i + 1))
+    return spans
+def batch_list(iterable, n=1):
+    l = len(iterable)
+    for ndx in range(0, l, n):
+        yield iterable[ndx:min(ndx + n, l)]
+def pad_seq(seq, max_len):
+    n = len(seq)
+    if n >= max_len:
+        return seq
+    else:
+        return np.pad(seq, (0, max_len - n))
+def align_decoded(x, d, y):
+    clean_text = ""
+    clean_label = []
+    j = 0
+    for i in range(len(d)):
+        found = False
+        for delim in [',', '.', '?', "'"]:
+            if (x[j:j + 2] == f" {delim}") and (d[i] == f"{delim}"):
+                found = True
+                clean_text += f' {delim}'
+                clean_label += [y[j], y[j]]
+                j += 1
+        if not found:
+            clean_text += x[j]
+            clean_label += [y[j]]
+        j += 1
+    if (clean_text != x) and (x[-1:] == "\n"):
+        clean_text += "\n"
+        clean_label += [0, 0]
+    return clean_text, clean_label
+def clean_entity(t):
+    t = t.lower()
+    t = t.replace(' \n', " ")
+    t = t.replace('\n', " ")
+    return t

vectors.kv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:55c8f6f379646d6ddb06d4f33d615e09f3354ce229271113e2ce57ae6164c673
+size 4914710