Vinbrain

init commit

f89e80b over 2 years ago

No virus

3.65 kB

	# <a name="introduction"></a> ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

	ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain.

	We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization.

	We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords.
	The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster [paper]() (updated soon):

	@article{vihealthbert,
	title = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}},
	author = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong },
	journal = {13th Edition of its Language Resources and Evaluation Conference},
	year = {2022}
	}

	### Installation <a name="install2"></a>
	- Python 3.6+, and PyTorch >= 1.6
	- Install `transformers`:
	`pip install transformers==4.2.0`

	### Pre-trained models <a name="models2"></a>

	Model \| #params \| Arch. \| Tokenizer
	---\|---\|---\|---
	`demdecuong/vihealthbert-base-word` \| 135M \| base \| Word-level
	`demdecuong/vihealthbert-base-syllable` \| 135M \| base \| Syllable-level

	### Example usage <a name="usage1"></a>

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word")
	tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word")

	# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
	line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

	input_ids = torch.tensor([tokenizer.encode(line)])
	with torch.no_grad():
	features = vihealthbert(input_ids) # Models outputs are now tuples
	```

	### Example usage for raw text <a name="usage2"></a>
	Since ViHealthBERT used the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data.
	We highly recommend use the same word-segmenter for ViHealthBERT downstream applications.

	#### Installation
	```
	# Install the vncorenlp python wrapper
	pip3 install vncorenlp

	# Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter)
	mkdir -p vncorenlp/models/wordsegmenter
	wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
	wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
	wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
	mv VnCoreNLP-1.1.1.jar vncorenlp/
	mv vi-vocab vncorenlp/models/wordsegmenter/
	mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/
	```

	`VnCoreNLP-1.1.1.jar` (27MB) and folder `models/` must be placed in the same working folder.

	#### Example usage
	```
	# See more details at: https://github.com/vncorenlp/VnCoreNLP

	# Load rdrsegmenter from VnCoreNLP
	from vncorenlp import VnCoreNLP
	rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m')

	# Input
	text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

	# To perform word (and sentence) segmentation
	sentences = rdrsegmenter.tokenize(text)
	for sentence in sentences:
	print(" ".join(sentence))
	```