# ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain.
We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization.
We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords.
The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster [paper]() (updated soon):
@article{vihealthbert,
title = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}},
author = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong },
journal = {13th Edition of its Language Resources and Evaluation Conference},
year = {2022}
}
### Installation
- Python 3.6+, and PyTorch >= 1.6
- Install `transformers`:
`pip install transformers==4.2.0`
### Pre-trained models
Model | #params | Arch. | Tokenizer
---|---|---|---
`demdecuong/vihealthbert-base-word` | 135M | base | Word-level
`demdecuong/vihealthbert-base-syllable` | 135M | base | Syllable-level
### Example usage
```python
import torch
from transformers import AutoModel, AutoTokenizer
vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word")
tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = vihealthbert(input_ids) # Models outputs are now tuples
```
### Example usage for raw text
Since ViHealthBERT used the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data.
We highly recommend use the same word-segmenter for ViHealthBERT downstream applications.
#### Installation
```
# Install the vncorenlp python wrapper
pip3 install vncorenlp
# Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter)
mkdir -p vncorenlp/models/wordsegmenter
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
mv VnCoreNLP-1.1.1.jar vncorenlp/
mv vi-vocab vncorenlp/models/wordsegmenter/
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/
```
`VnCoreNLP-1.1.1.jar` (27MB) and folder `models/` must be placed in the same working folder.
#### Example usage
```
# See more details at: https://github.com/vncorenlp/VnCoreNLP
# Load rdrsegmenter from VnCoreNLP
from vncorenlp import VnCoreNLP
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m')
# Input
text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
# To perform word (and sentence) segmentation
sentences = rdrsegmenter.tokenize(text)
for sentence in sentences:
print(" ".join(sentence))
```