monsoon-nlp
/

dv-labse

Inference Endpoints

Model card Files Files and versions Community

dv-labse / README.md

monsoon-nlp's picture

training with tokenizer on

7027ae8 almost 4 years ago

|

991 Bytes

	---
	language: dv
	---

	# dv-labse

	This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and
	word-piece tokens into Google's LaBSE model.

	- Original model weights: https://huggingface.co/setu4993/LaBSE
	- Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

	This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on
	the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

	- mBERT: 52%
	- dv-wave (ELECTRA): 89%
	- dv-muril: 90.7%
	- dv-labse: 91.3-91.5% (may continue training)

	## Training

	- Start with LaBSE (similar to mBERT) with no Thaana vocabulary
	- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
	- Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
	- Continue BERT pretraining on Dhivehi text

	CoLab notebook:
	https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi