Init commit (#1)

Browse files

- Init commit (41aee3ead7ff81fdb5767b1d58106e2a9e9d0cc4)

Co-authored-by: Daniels Buls <[email protected]>

Files changed (8) hide show

.gitignore +2 -0
README.md +95 -4
config.json +50 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Ignore .idea folder
2	+ .idea/

README.md CHANGED Viewed

@@ -1,8 +1,99 @@
 ---
-license: apache-2.0
 datasets:
-- AiLab-IMCS-UL/lv_go_emotions
 language:
 - lv
-pipeline_tag: text-classification
----

 ---
+license: mit
 datasets:
+- SkyWater21/lv_go_emotions
 language:
 - lv
+---
+Fine-tuned [LVBERT](https://huggingface.co/AiLab-IMCS-UL/lvbert) for multi-label emotion classification task.
+Model was trained on [lv_go_emotions](https://huggingface.co/datasets/SkyWater21/lv_go_emotions) dataset. This dataset is Latvian translation of [GoEmotions](https://huggingface.co/datasets/go_emotions) dataset. Google Translate was used to generate the machine translation.
+Original 26 emotions were mapped to 6 base emotions as per Dr. Ekman theory.
+Labels predicted by classifier:
+```yaml
+0: anger
+1: disgust
+2: fear
+3: joy
+4: sadness
+5: surprise
+6: neutral
+```
+Label mapping from 27 emotions from GoEmotion to 6 base emotions as per Dr. Ekman theory:
+|GoEmotion|Ekman|
+|---|---|
+| admiration | joy|
+| amusement | joy|
+| anger | anger|
+| annoyance | anger|
+| approval | joy|
+| caring | joy|
+| confusion | surprise|
+| curiosity | surprise|
+| desire | joy|
+| disappointment | sadness|
+| disapproval | anger|
+| disgust | disgust|
+| embarrassment | sadness|
+| excitement | joy|
+| fear | fear|
+| gratitude | joy|
+| grief | sadness|
+| joy | joy|
+| love | joy|
+| nervousness | fear|
+| optimism | joy|
+| pride | joy|
+| realization | surprise|
+| relief | joy|
+| remorse | sadness|
+| sadness | sadness|
+| surprise | surprise|
+| neutral | neutral|
+Seed used for random number generator is 42:
+```python
+def set_seed(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+```
+Training parameters:
+```yaml
+max_length: null
+batch_size: 32
+shuffle: True
+num_workers: 2
+pin_memory: False
+drop_last: False
+optimizer: adam
+lr: 0.00001
+weight_decay: 0
+problem_type: multi_label_classification
+num_epochs: 3
+```
+Evaluation results on test split of [lv_go_emotions](https://huggingface.co/datasets/SkyWater21/lv_go_emotions/viewer/simplified_ekman)
+|              |Precision|Recall|F1-Score|AUC-ROC|Support|
+|--------------|---------|------|--------|-------|-------|
+|anger         |     0.57|  0.40|    0.47|   0.85|   726|
+|disgust       |     0.64|  0.28|    0.39|   0.93|   123|
+|fear          |     0.63|  0.54|    0.58|   0.95|    98|
+|joy           |     0.80|  0.79|    0.79|   0.91|  2104|
+|sadness       |     0.70|  0.44|    0.54|   0.90|   379|
+|surprise      |     0.63|  0.44|    0.52|   0.89|   677|
+|neutral       |     0.65|  0.62|    0.64|   0.83|  1787|
+|micro avg     |     0.70|  0.61|    0.66|   0.93|  5894|
+|macro avg     |     0.66|  0.50|    0.56|   0.89|  5894|
+|weighted avg  |     0.69|  0.61|    0.65|   0.88|  5894|
+|samples avg   |     0.65|  0.63|    0.63|    nan|  5894|

config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "_name_or_path": "AiLab-IMCS-UL/lvbert",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "anger",
+    "1": "disgust",
+    "2": "fear",
+    "3": "joy",
+    "4": "sadness",
+    "5": "surprise",
+    "6": "neutral"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "anger": 0,
+    "disgust": 1,
+    "fear": 2,
+    "joy": 3,
+    "neutral": 6,
+    "sadness": 4,
+    "surprise": 5
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "problem_type": "multi_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.39.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32004
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fdf83f57d45707e774742a21d24d695e6222f14e18d41b5748a8c0f28a9e1d3
+size 442526732

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff