README.md · robzchhangte/MizBERT at 51713b4921df25b0e2fd96da6f37303ee2cc3031

MizBERT: A Masked Language Model for Mizo Text Understanding

Overview

MizBERT is a masked language model (MLM) pre-trained on a corpus of Mizo text data. It is based on the BERT (Bidirectional Encoder Representations from Transformers) architecture and leverages the MLM objective to effectively learn contextual representations of words in the Mizo language.

Key Features

Mizo-Specific: MizBERT is specifically tailored to the Mizo language, capturing its unique linguistic nuances and vocabulary.
MLM Objective: The MLM objective trains MizBERT to predict masked words based on the surrounding context, fostering a deep understanding of Mizo semantics.
Contextual Embeddings: MizBERT generates contextualized word embeddings that encode the meaning of a word in relation to its surrounding text.
Transfer Learning: MizBERT's pre-trained weights can be fine-tuned for various downstream tasks in Mizo NLP, such as text classification, question answering, and sentiment analysis.

Potential Applications

Mizo NLP Research: MizBERT can serve as a valuable foundation for further research in Mizo natural language processing.
Mizo Machine Translation: Fine-tuned MizBERT models can be used to develop robust machine translation systems for Mizo and other languages.
Mizo Text Classification: MizBERT can be adapted for tasks like sentiment analysis, topic modeling, and spam detection in Mizo text.
Mizo Question Answering: Fine-tuned MizBERT models can power question answering systems that can effectively answer questions posed in Mizo.
Mizo Chatbots: MizBERT can be integrated into chatbots to enable them to communicate and understand Mizo more effectively.

Getting Started

To use MizBERT in your Mizo NLP projects, you can install it from the Hugging Face Transformers library:

pip install transformers

Then, import and use MizBERT like other pre-trained models in the library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("robzchhangte/mizbert")
model = AutoModelForMaskedLM.from_pretrained("robzchhangte/mizbert")

To Predict Mask Token

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="robzchhangte/mizbert")

sentence = "Miten kan thiltih [MASK] min teh thin" ##Expected token "atangin". In English: A tree is known by its fruit.
predictions = fill_mask(sentence)

for prediction in predictions:
    print(prediction["sequence"].replace("[CLS]", "").replace("[SEP]", "").strip(), "| Score:", prediction["score"])

Citation

@article{10.1145/3666003, author = {Lalramhluna, Robert and Dash, Sandeep and Pakray, Dr.Partha}, title = {MizBERT: A Mizo BERT Model}, year = {2024}, issue_date = {July 2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {23}, number = {7}, issn = {2375-4699}, url = {https://doi.org/10.1145/3666003}, doi = {10.1145/3666003}, abstract = {This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: masked language modeling and perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the Multilingual BERT model and the Support Vector Machine algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.}, journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, month = {jun}, articleno = {99}, numpages = {14}, keywords = {Mizo, BERT, pre-trained language model} }