metadata

language: dv

dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

Training

Start with LaBSE (similar to mBERT) with no Thaana vocabulary
Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
Continue BERT pretraining on Dhivehi text