File size: 3,305 Bytes
cbdec26
2e4a315
 
29ef851
2e4a315
 
 
cbdec26
2e4a315
 
cbdec26
2e4a315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
language: hy
tags:
- exbert
- armenian
- mlm
- llm
license: mit
datasets:
- oscar
---

# Model Card for HyeBERT

Pre-trained language model trained on Armenian using a masked language training strategy. The architecture is based on [BERT](https://arxiv.org/abs/1810.04805) but trained exclusively for the Armenian language subset of OSCAR, a cleaned and de-duplicated subset of the common crawl dataset (hence, the `Hye` in HyeBERT).

Disclaimer: this model is not specifically trained for either the Western or Eastern dialect, though the data likely contain more examples of Eastern Armenian.

### Model Description

HyeBERT is shares the same architecture as BERT; it is a stacked transformer model trained on a large corpus of Armenian without any human annotations. However, it was trained using only the mask language task (replacing 15% of tokens with `[MASK]` and trying to predict them from the other tokens in the text) and not to predict the next sentence, making it more akin to [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). Unlike RoBERTa, however, it tokenizes using WordPiece rather than BPE.


## Inteded Uses

### Direct Use

As an MLM, this model can be used to predict word in a sentence or text generation, though generation would best be done with a model like GPT.

### Downstream Use [optional]

The ideal use of this model is fine-tuning on a specific classification task for Armenian.

## Bias, Risks, and Limitations

As mentioned earlier, this model is not trained exclusively on Western or Eastern Armenian which may lead to problems in its internal understanding of the language's syntax and lexicon. In addition, many of the training texts include content from other languages (mostly English and Russian) which may affect the performance of the model.

## How to Get Started with the Model

Use the code below to get started with the model.

{{ get_started_code | default("[More Information Needed]", true)}}

## Training Details

### Training Data

This model was trained on the Armenian subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus, which is a cleaned version of the common crawl. The training data consiset of roughly XXX document, with roughly YYY tokens in total. 2% of the total dataset was held out and using as validation.

### Training Procedure 

The model was trained by masking 15% of tokens and predicting the identity of those masked tokens from the unmasked items in a training datum. The model was trained over 3 epochs and the identify of the masked token for a given text was reassigned for each epoch, i.e., the masks moved around each epoch.

#### Preprocessing 

No major preprocessing. Texts of less than 5 character were removed and texts were limited to 512 tokens.


#### Training Hyperparameters

- Optimizer: AdamW
- Learning rate: `1e4`
- Num. attention head: 12
- Num. hidden layers: 6
- Vocab. size: 30,000
- Embedding size: 768

## Evaluation

At each epoch's completion, the loss was computed for a held out validation set, roughly 2% the size of the total data. 
```
0 evaluating....
	val_loss: 0.47787963975066194

1 evaluating....
	val_loss: 0.47497553823474115

2 evaluating....
	val_loss: 0.4765327044259816
```

## Model Card Authors [optional]

Adam King

## Model Card Contact

[email protected]