|
--- |
|
license: mit |
|
language: |
|
- fr |
|
metrics: |
|
- seqeval |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
tags: |
|
- medical |
|
- biomedical |
|
- medkit-lib |
|
widget: |
|
- text: >- |
|
La radiographie et la tomodensitométrie ont montré des micronodules diffus |
|
example_title: example 1 |
|
- text: >- |
|
Elle souffre d'asthme mais n'a pas besoin d'Allegra |
|
example_title: example 2 |
|
--- |
|
|
|
|
|
# DrBERT-CASM2 |
|
|
|
## Model description |
|
|
|
**DrBERT-CASM2** is a French Named Entity Recognition model that was fine-tuned from |
|
[DrBERT](https://huggingface.co/Dr-BERT/DrBERT-4GB-CP-PubMedBERT): A PreTrained model in French for biomedical and clinical domains. |
|
It has been trained to detect the following type of entities: **problem**, **treatment** and **test** using the medkit Trainer. |
|
|
|
- **Fine-tuned using** medkit [GitHub Repo](https://github.com/TeamHeka/medkit) |
|
- **Developed by** @camila-ud, medkit, HeKA Research team |
|
- **Dataset source** |
|
|
|
Annotated version from @aneuraz called 'corpusCasM2: A corpus of annotated clinical texts' |
|
- The annotation was performed collaborativelly by the students of masters students from Université Paris Cité. |
|
|
|
- The corpus contains documents from CAS: |
|
``` |
|
Natalia Grabar, Vincent Claveau, and Clément Dalloux. 2018. CAS: French Corpus with Clinical Cases. |
|
In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, |
|
pages 122–128, Brussels, Belgium. Association for Computational Linguistics. |
|
``` |
|
# Intended uses & limitations |
|
|
|
## Limitations and bias |
|
|
|
This model was trained for **development and test phases**. |
|
This model is limited by its training dataset, and it should be used with caution. |
|
The results are not guaranteed, and the model should be used only in data exploration stages. |
|
The model may be able to detect entities in the early stages of the analysis of medical documents in French. |
|
|
|
The maximum token size was reduced to **128 tokens** to minimize training time. |
|
|
|
# How to use |
|
|
|
## Install medkit |
|
|
|
First of all, please install medkit with the following command: |
|
|
|
``` |
|
pip install 'medkit-lib[optional]' |
|
``` |
|
|
|
Please check the [documentation](https://medkit.readthedocs.io/en/latest/user_guide/install.html) for more info and examples. |
|
|
|
## Using the model |
|
|
|
```python |
|
from medkit.core.text import TextDocument |
|
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher |
|
|
|
matcher = HFEntityMatcher(model="camila-ud/DrBERT-CASM2") |
|
|
|
test_doc = TextDocument("Elle souffre d'asthme mais n'a pas besoin d'Allegra") |
|
detected_entities = matcher.run([test_doc.raw_segment]) |
|
|
|
# show information |
|
msg = "|".join(f"'{entity.label}':{entity.text}" for entity in detected_entities) |
|
print(f"Text: '{test_doc.text}'\n{msg}") |
|
``` |
|
``` |
|
Text: "Elle souffre d'asthme mais n'a pas besoin d'Allegra" |
|
'problem':asthme|'treatment':Allegra |
|
``` |
|
|
|
# Training data |
|
|
|
This model was fine-tuned on **CASM2**, an internal corpus with clinical cases (in french) annotated by master students. |
|
The corpus contains more than 5000 medkit documents (~ phrases) with entities to detect. |
|
|
|
**Number of documents (~ phrases) by split** |
|
|
|
| Split | # medkit docs | |
|
| ---------- | ------------- | |
|
| Train | 5824 | |
|
| Validation | 1457 | |
|
| Test | 1821 | |
|
|
|
|
|
**Number of examples per entity type** |
|
|
|
| Split | treatment | test | problem | |
|
| ---------- | --------- | ---- | ------- | |
|
| Train | 3258 | 3990 | 6808 | |
|
| Validation | 842 | 1007 | 1745 | |
|
| Test | 994 | 1289 | 2113 | |
|
|
|
## Training procedure |
|
|
|
This model was fine-tuned using the medkit trainer on CPU, it takes about 3h. |
|
|
|
# Model perfomances |
|
|
|
Model performances computes on CASM2 test dataset (using medkit seqeval evaluator) |
|
|
|
Entity|precision|recall|f1 |
|
-|-|-|- |
|
treatment|0.7492|0.7666|0.7578 |
|
test|0.7449|0.8240|0.7824 |
|
problem|0.6884|0.7304|0.7088 |
|
Overall|0.7188|0.7660|0.7416 |
|
|
|
## How to evaluate using medkit |
|
```python |
|
from medkit.text.metrics.ner import SeqEvalEvaluator |
|
|
|
# load the matcher and get predicted entities by document |
|
matcher = HFEntityMatcher(model="camila-ud/DrBERT-CASM2") |
|
predicted_entities = [matcher.run([doc.raw_segment]) for doc in test_documents] |
|
|
|
evaluator = SeqEvalEvaluator(tagging_scheme="iob2") |
|
evaluator.compute(test_documents,predicted_entities=predicted_entities) |
|
``` |
|
You can use the tokenizer from HF to evaluate by tokens instead of characters |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer_drbert = AutoTokenizer.from_pretrained("camila-ud/DrBERT-CASM2", use_fast=True) |
|
|
|
evaluator = SeqEvalEvaluator(tokenizer=tokenizer_drbert,tagging_scheme="iob2") |
|
evaluator.compute(test_documents,predicted_entities=predicted_entities) |
|
``` |
|
|
|
# Citation |
|
|
|
``` |
|
@online{medkit-lib, |
|
author={HeKA Research Team}, |
|
title={medkit, A Python library for a learning health system}, |
|
url={https://pypi.org/project/medkit-lib/}, |
|
urldate = {2023-07-24}, |
|
} |
|
``` |
|
``` |
|
HeKA Research Team, “medkit, a Python library for a learning health system.” https://pypi.org/project/medkit-lib/ (accessed Jul. 24, 2023). |
|
``` |