Model Card for HyeBERT

Pre-trained language model trained on Armenian using a masked language training strategy. The architecture is based on BERT but trained exclusively for the Armenian language subset of OSCAR, a cleaned and de-duplicated subset of the common crawl dataset (hence, the Hye in HyeBERT).

Disclaimer: this model is not specifically trained for either the Western or Eastern dialect, though the data likely contain more examples of Eastern Armenian.

Model Description

HyeBERT is shares the same architecture as BERT; it is a stacked transformer model trained on a large corpus of Armenian without any human annotations. However, it was trained using only the mask language task (replacing 15% of tokens with [MASK] and trying to predict them from the other tokens in the text) and not to predict the next sentence, making it more akin to RoBERTa. Unlike RoBERTa, however, it tokenizes using WordPiece rather than BPE.

Inteded Uses

Direct Use

As an MLM, this model can be used to predict word in a sentence or text generation, though generation would best be done with a model like GPT.

Downstream Use [optional]

The ideal use of this model is fine-tuning on a specific classification task for Armenian.

Bias, Risks, and Limitations

As mentioned earlier, this model is not trained exclusively on Western or Eastern Armenian which may lead to problems in its internal understanding of the language's syntax and lexicon. In addition, many of the training texts include content from other languages (mostly English and Russian) which may affect the performance of the model.

How to Get Started with the Model

Use the code below to get started with the model.

{{ get_started_code | default("[More Information Needed]", true)}}

Training Details

Training Data

This model was trained on the Armenian subset of the OSCAR corpus, which is a cleaned version of the common crawl. The training data consiset of roughly XXX document, with roughly YYY tokens in total. 2% of the total dataset was held out and using as validation.

Training Procedure

The model was trained by masking 15% of tokens and predicting the identity of those masked tokens from the unmasked items in a training datum. The model was trained over 3 epochs and the identify of the masked token for a given text was reassigned for each epoch, i.e., the masks moved around each epoch.

Preprocessing

No major preprocessing. Texts of less than 5 character were removed and texts were limited to 512 tokens.

Training Hyperparameters

Optimizer: AdamW
Learning rate: 1e4
Num. attention head: 12
Num. hidden layers: 6
Vocab. size: 30,000
Embedding size: 768

Evaluation

At each epoch's completion, the loss was computed for a held out validation set, roughly 2% the size of the total data.

0 evaluating....
    val_loss: 0.47787963975066194

1 evaluating....
    val_loss: 0.47497553823474115

2 evaluating....
    val_loss: 0.4765327044259816

Model Card Authors [optional]

Adam King

Model Card Contact

[email protected]

aking11
/

hyebert