aking11
/

hyebert

 ---
+language: hy
+tags:
+- bert
+- armenian
+- mlm
+- llm
 license: mit
+datasets:
+- oscar
 ---
+# Model Card for HyeBERT
+Pre-trained language model trained on Armenian using a masked language training strategy. The architecture is based on [BERT](https://arxiv.org/abs/1810.04805) but trained exclusively for the Armenian language subset of OSCAR, a cleaned and de-duplicated subset of the common crawl dataset (hence, the `Hye` in HyeBERT).
+Disclaimer: this model is not specifically trained for either the Western or Eastern dialect, though the data likely contain more examples of Eastern Armenian.
+### Model Description
+HyeBERT is shares the same architecture as BERT; it is a stacked transformer model trained on a large corpus of Armenian without any human annotations. However, it was trained using only the mask language task (replacing 15% of tokens with `[MASK]` and trying to predict them from the other tokens in the text) and not to predict the next sentence, making it more akin to [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). Unlike RoBERTa, however, it tokenizes using WordPiece rather than BPE.
+## Inteded Uses
+### Direct Use
+As an MLM, this model can be used to predict word in a sentence or text generation, though generation would best be done with a model like GPT.
+### Downstream Use [optional]
+The ideal use of this model is fine-tuning on a specific classification task for Armenian.
+## Bias, Risks, and Limitations
+As mentioned earlier, this model is not trained exclusively on Western or Eastern Armenian which may lead to problems in its internal understanding of the language's syntax and lexicon. In addition, many of the training texts include content from other languages (mostly English and Russian) which may affect the performance of the model.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+{{ get_started_code | default("[More Information Needed]", true)}}
+## Training Details
+### Training Data
+This model was trained on the Armenian subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus, which is a cleaned version of the common crawl. The training data consiset of roughly XXX document, with roughly YYY tokens in total. 2% of the total dataset was held out and using as validation.
+### Training Procedure
+The model was trained by masking 15% of tokens and predicting the identity of those masked tokens from the unmasked items in a training datum. The model was trained over 3 epochs and the identify of the masked token for a given text was reassigned for each epoch, i.e., the masks moved around each epoch.
+#### Preprocessing
+No major preprocessing. Texts of less than 5 character were removed and texts were limited to 512 tokens.
+#### Training Hyperparameters
+- Optimizer: AdamW
+- Learning rate: `1e4`
+- Num. attention head: 12
+- Num. hidden layers: 6
+- Vocab. size: 30,000
+- Embedding size: 768
+## Evaluation
+At each epoch's completion, the loss was computed for a held out validation set, roughly 2% the size of the total data.
+```
+0 evaluating....
+	val_loss: 0.47787963975066194
+1 evaluating....
+	val_loss: 0.47497553823474115
+2 evaluating....
+	val_loss: 0.4765327044259816
+```
+## Model Card Authors [optional]
+Adam King
+## Model Card Contact
+[email protected]