Adam K commited on
Commit
2e4a315
1 Parent(s): cbdec26

first commit

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md CHANGED
@@ -1,3 +1,90 @@
1
  ---
 
 
 
 
 
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: hy
3
+ tags:
4
+ - bert
5
+ - armenian
6
+ - mlm
7
+ - llm
8
  license: mit
9
+ datasets:
10
+ - oscar
11
  ---
12
+
13
+ # Model Card for HyeBERT
14
+
15
+ Pre-trained language model trained on Armenian using a masked language training strategy. The architecture is based on [BERT](https://arxiv.org/abs/1810.04805) but trained exclusively for the Armenian language subset of OSCAR, a cleaned and de-duplicated subset of the common crawl dataset (hence, the `Hye` in HyeBERT).
16
+
17
+ Disclaimer: this model is not specifically trained for either the Western or Eastern dialect, though the data likely contain more examples of Eastern Armenian.
18
+
19
+ ### Model Description
20
+
21
+ HyeBERT is shares the same architecture as BERT; it is a stacked transformer model trained on a large corpus of Armenian without any human annotations. However, it was trained using only the mask language task (replacing 15% of tokens with `[MASK]` and trying to predict them from the other tokens in the text) and not to predict the next sentence, making it more akin to [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf). Unlike RoBERTa, however, it tokenizes using WordPiece rather than BPE.
22
+
23
+
24
+ ## Inteded Uses
25
+
26
+ ### Direct Use
27
+
28
+ As an MLM, this model can be used to predict word in a sentence or text generation, though generation would best be done with a model like GPT.
29
+
30
+ ### Downstream Use [optional]
31
+
32
+ The ideal use of this model is fine-tuning on a specific classification task for Armenian.
33
+
34
+ ## Bias, Risks, and Limitations
35
+
36
+ As mentioned earlier, this model is not trained exclusively on Western or Eastern Armenian which may lead to problems in its internal understanding of the language's syntax and lexicon. In addition, many of the training texts include content from other languages (mostly English and Russian) which may affect the performance of the model.
37
+
38
+ ## How to Get Started with the Model
39
+
40
+ Use the code below to get started with the model.
41
+
42
+ {{ get_started_code | default("[More Information Needed]", true)}}
43
+
44
+ ## Training Details
45
+
46
+ ### Training Data
47
+
48
+ This model was trained on the Armenian subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus, which is a cleaned version of the common crawl. The training data consiset of roughly XXX document, with roughly YYY tokens in total. 2% of the total dataset was held out and using as validation.
49
+
50
+ ### Training Procedure
51
+
52
+ The model was trained by masking 15% of tokens and predicting the identity of those masked tokens from the unmasked items in a training datum. The model was trained over 3 epochs and the identify of the masked token for a given text was reassigned for each epoch, i.e., the masks moved around each epoch.
53
+
54
+ #### Preprocessing
55
+
56
+ No major preprocessing. Texts of less than 5 character were removed and texts were limited to 512 tokens.
57
+
58
+
59
+ #### Training Hyperparameters
60
+
61
+ - Optimizer: AdamW
62
+ - Learning rate: `1e4`
63
+ - Num. attention head: 12
64
+ - Num. hidden layers: 6
65
+ - Vocab. size: 30,000
66
+ - Embedding size: 768
67
+
68
+ ## Evaluation
69
+
70
+ At each epoch's completion, the loss was computed for a held out validation set, roughly 2% the size of the total data.
71
+ ```
72
+ 0 evaluating....
73
+ val_loss: 0.47787963975066194
74
+
75
+ 1 evaluating....
76
+ val_loss: 0.47497553823474115
77
+
78
+ 2 evaluating....
79
+ val_loss: 0.4765327044259816
80
+ ```
81
+
82
+ ## Model Card Authors [optional]
83
+
84
+ Adam King
85
+
86
+ ## Model Card Contact
87
+
88
89
+
90
+