YagiASAFAS commited on
Commit
c04024e
1 Parent(s): 0a0e3a5

Add tokenizer files

Browse files
Files changed (5) hide show
  1. README.md +27 -73
  2. config.json +36 -36
  3. pytorch_model.bin +1 -1
  4. tokenizer.json +0 -0
  5. training_args.bin +1 -1
README.md CHANGED
@@ -5,58 +5,33 @@ tags:
5
  metrics:
6
  - accuracy
7
  model-index:
8
- - name: malaysia-news-classification-bert-english
9
  results: []
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
  should probably proofread and complete it, then remove this comment. -->
14
 
15
- # malaysia-news-classification-bert-english
16
 
17
- This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on tnwei/ms-newspapers dataset.
18
  It achieves the following results on the evaluation set:
19
- - Loss: 1.1168
20
- - Accuracy: 0.8948
21
 
22
  ## Model description
23
 
24
- This BERT-based classification model is specifically designed to categorize English-language news articles from Malaysia into 18 distinct labels, allowing for nuanced understanding of local news themes. The model utilizes a pre-trained BERT (Bidirectional Encoder Representations from Transformers) architecture, renowned for its effectiveness in natural language processing tasks due to its deep understanding of language context.
25
 
26
  ## Intended uses & limitations
27
 
28
- The BERT-based classification model is designed to categorize English-language news articles from Malaysia into 18 predefined categories. The primary goal is to accurately reflect and align with Malaysia's unique cultural and contextual nuances within the news.
29
- While the model is optimized for Malaysian news content, it has several limitations:
30
-
31
- Cultural and Contextual Specificity: The model is specifically trained to interpret and categorize news based on Malaysia's unique cultural and contextual framework. As a result, its accuracy and relevance drop significantly when applied to news content from other countries or in different languages.
32
- - Generalizability: The model's training on Malaysian-specific news content limits its generalizability to broader, international contexts. It may not perform well with news from other regions as it may not correctly interpret cultural nuances, idiomatic expressions, or context-specific references not present in Malaysian news.
33
- - Dynamic News Landscape: The model may require frequent retraining to stay relevant due to the dynamic nature of news and ongoing cultural and societal changes. What is considered an important category or context might evolve, necessitating updates to both the model and its training data.
34
- - Bias and Sensitivity: Like any data-driven model, there is a risk of inheriting biases from the training dataset. Careful consideration and continuous monitoring are needed to ensure that the model does not perpetuate any form of cultural bias or insensitivity.
35
 
36
  ## Training and evaluation data
37
 
38
- The model was trained and evaluated on a dataset comprising 18,003 English-language news articles from Malaysia, categorized into 18 distinct classes corresponding to various news categories. This dataset was split into two subsets:
39
-
40
- - Training Data: 80% of the dataset (approximately 14,402 articles) was used for training. This substantial portion ensures that the model has ample examples to learn from, encompassing a wide array of topics and linguistic nuances.
41
- - Validation Data: The remaining 20% (approximately 3,601 articles) was used for validation. This set is crucial for gauging the model's performance and generalization on unseen data, helping to mitigate overfitting and bias.
42
-
43
- ### Evaluation Strategy:
44
- Regular evaluations were conducted after certain numbers of training steps to monitor the model’s performance and adjust parameters if necessary. This frequent evaluation helps identify the best model configuration during training and avoids extensive periods of potential overfitting.
45
-
46
- ### Performance Metrics:
47
- The model's performance was assessed using the loss and accuracy metrics:
48
-
49
- - Training Loss: Showed a consistent decrease from 0.6359 in the first epoch to effectively zero by the last epoch, indicating good learning progress.
50
- - Validation Loss: Increased over epochs, suggesting issues with model generalization despite decreasing training loss.
51
- - Accuracy: Increased to approximately 89.4751% by the end of the training, reflecting the model's ability to correctly classify a high percentage of the validation set.
52
 
53
  ## Training procedure
54
- Training was conducted over 16 epochs, with the following parameters configured to optimize learning:
55
-
56
- - Batch Size: An instantaneous batch size of 8 was used per device, with no parallel or distributed settings applied, resulting in a total train batch size of 8.
57
- - Optimization Steps: The model completed a total of 28,816 optimization steps, aligning with the batch size and data volume to ensure comprehensive exposure to the training data.
58
- - Optimizer: The model used the AdamW optimizer, a variant of Adam better suited for these types of models as it handles weight decay more effectively. However, a future warning was noted suggesting the transition to PyTorch’s native AdamW implementation for future uses.
59
- - Gradient Accumulation: The training utilized a gradient accumulation strategy, accumulating gradients over steps to enhance training stability and performance on smaller batch sizes.
60
 
61
  ### Training hyperparameters
62
 
@@ -70,52 +45,31 @@ The following hyperparameters were used during training:
70
  - num_epochs: 16
71
  - mixed_precision_training: Native AMP
72
 
73
- ## Label Mappings
74
- This model can predict the following labels:
75
- - `0`: Election
76
- - `1`: Political Issue
77
- - `2`: Corruption
78
- - `3`: Democracy
79
- - `4`: Economic Growth
80
- - `5`: Economic Disparity
81
- - `6`: Economic Subsidy
82
- - `7`: Ethnic Discrimination
83
- - `8`: Ethnic Relation
84
- - `9`: Ethnic Culture
85
- - `10`: Religious Issue
86
- - `11`: Business and Finance
87
- - `12`: Sport
88
- - `13`: Food
89
- - `14`: Entertainment
90
- - `15`: Environmental Issue
91
- - `16`: Domestic News
92
- - `17`: World News
93
-
94
  ### Training results
95
 
96
- | Training Loss | Epoch | Step | Validation Loss | Accuracy |
97
- |:-------------:|:-----:|:-----:|:---------------:|:--------:|
98
- | 0.6359 | 1.0 | 1801 | 0.5633 | 0.8453 |
99
- | 0.4081 | 2.0 | 3602 | 0.6275 | 0.8686 |
100
- | 0.2978 | 3.0 | 5403 | 0.6597 | 0.8792 |
101
- | 0.1796 | 4.0 | 7204 | 0.7368 | 0.8750 |
102
- | 0.1567 | 5.0 | 9005 | 0.9289 | 0.8636 |
103
- | 0.0998 | 6.0 | 10806 | 0.9482 | 0.8792 |
104
- | 0.0798 | 7.0 | 12607 | 0.9669 | 0.8781 |
105
- | 0.0619 | 8.0 | 14408 | 0.9942 | 0.8859 |
106
- | 0.0668 | 9.0 | 16209 | 1.0687 | 0.8781 |
107
- | 0.0298 | 10.0 | 18010 | 1.1998 | 0.8711 |
108
- | 0.0206 | 11.0 | 19811 | 1.2359 | 0.8775 |
109
- | 0.0174 | 12.0 | 21612 | 1.1635 | 0.8759 |
110
- | 0.013 | 13.0 | 23413 | 1.1226 | 0.8825 |
111
- | 0.0051 | 14.0 | 25214 | 1.1463 | 0.8848 |
112
- | 0.0073 | 15.0 | 27015 | 1.1652 | 0.8889 |
113
- | 0.0 | 16.0 | 28816 | 1.1168 | 0.8948 |
114
 
115
 
116
  ### Framework versions
117
 
118
  - Transformers 4.18.0
119
  - Pytorch 2.2.1+cu121
120
- - Datasets 2.18.0
121
  - Tokenizers 0.12.1
 
5
  metrics:
6
  - accuracy
7
  model-index:
8
+ - name: malaysia-news-classification-bert-english-skewness-fixed
9
  results: []
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
  should probably proofread and complete it, then remove this comment. -->
14
 
15
+ # malaysia-news-classification-bert-english-skewness-fixed
16
 
17
+ This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on an unknown dataset.
18
  It achieves the following results on the evaluation set:
19
+ - Loss: 1.2051
20
+ - Accuracy: 0.8436
21
 
22
  ## Model description
23
 
24
+ More information needed
25
 
26
  ## Intended uses & limitations
27
 
28
+ More information needed
 
 
 
 
 
 
29
 
30
  ## Training and evaluation data
31
 
32
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Training procedure
 
 
 
 
 
 
35
 
36
  ### Training hyperparameters
37
 
 
45
  - num_epochs: 16
46
  - mixed_precision_training: Native AMP
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ### Training results
49
 
50
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy |
51
+ |:-------------:|:-----:|:----:|:---------------:|:--------:|
52
+ | No log | 1.0 | 358 | 0.9357 | 0.7486 |
53
+ | 1.3554 | 2.0 | 716 | 0.9041 | 0.7807 |
54
+ | 0.4851 | 3.0 | 1074 | 0.7842 | 0.8282 |
55
+ | 0.4851 | 4.0 | 1432 | 0.9478 | 0.8226 |
56
+ | 0.2558 | 5.0 | 1790 | 1.0765 | 0.8282 |
57
+ | 0.1084 | 6.0 | 2148 | 1.1310 | 0.8380 |
58
+ | 0.0625 | 7.0 | 2506 | 1.0999 | 0.8464 |
59
+ | 0.0625 | 8.0 | 2864 | 1.1391 | 0.8408 |
60
+ | 0.0301 | 9.0 | 3222 | 1.1036 | 0.8506 |
61
+ | 0.0171 | 10.0 | 3580 | 1.0765 | 0.8534 |
62
+ | 0.0171 | 11.0 | 3938 | 1.1291 | 0.8506 |
63
+ | 0.0129 | 12.0 | 4296 | 1.1360 | 0.8520 |
64
+ | 0.0035 | 13.0 | 4654 | 1.1619 | 0.8450 |
65
+ | 0.0039 | 14.0 | 5012 | 1.1727 | 0.8534 |
66
+ | 0.0039 | 15.0 | 5370 | 1.2079 | 0.8408 |
67
+ | 0.0031 | 16.0 | 5728 | 1.2051 | 0.8436 |
68
 
69
 
70
  ### Framework versions
71
 
72
  - Transformers 4.18.0
73
  - Pytorch 2.2.1+cu121
74
+ - Datasets 2.19.0
75
  - Tokenizers 0.12.1
config.json CHANGED
@@ -10,46 +10,46 @@
10
  "hidden_dropout_prob": 0.1,
11
  "hidden_size": 768,
12
  "id2label": {
13
- "0": "Election",
14
- "1": "Political Issue",
15
- "2": "Corruption",
16
- "3": "Democracy",
17
- "4": "Economic Growth",
18
- "5": "Economic Disparity",
19
- "6": "Economic Subsidy",
20
- "7": "Ethnic Discrimination",
21
- "8": "Ethnic Relation",
22
- "9": "Ethnic Culture",
23
- "10": "Religious Issue",
24
- "11": "Business and Finance",
25
- "12": "Sport",
26
- "13": "Food",
27
- "14": "Entertainment",
28
- "15": "Environmental Issues",
29
- "16": "Domestic News",
30
- "17": "World News"
31
  },
32
  "initializer_range": 0.02,
33
  "intermediate_size": 3072,
34
  "label2id": {
35
- "Election": 0,
36
- "Political Issue": 1,
37
- "Corruption": 2,
38
- "Democracy": 3,
39
- "Economic Growth": 4,
40
- "Economic Disparity": 5,
41
- "Economic Subsidy": 6,
42
- "Ethnic Discrimination": 7,
43
- "Ethnic Relation": 8,
44
- "Ethnic Culture": 9,
45
- "Religious Issue": 10,
46
- "Business and Finance": 11,
47
- "Sport": 12,
48
- "Food": 13,
49
- "Entertainment": 14,
50
- "Environmental Issues": 15,
51
- "Domestic News": 16,
52
- "World News": 17
53
  },
54
  "layer_norm_eps": 1e-12,
55
  "max_position_embeddings": 512,
 
10
  "hidden_dropout_prob": 0.1,
11
  "hidden_size": 768,
12
  "id2label": {
13
+ "0": "LABEL_0",
14
+ "1": "LABEL_1",
15
+ "2": "LABEL_2",
16
+ "3": "LABEL_3",
17
+ "4": "LABEL_4",
18
+ "5": "LABEL_5",
19
+ "6": "LABEL_6",
20
+ "7": "LABEL_7",
21
+ "8": "LABEL_8",
22
+ "9": "LABEL_9",
23
+ "10": "LABEL_10",
24
+ "11": "LABEL_11",
25
+ "12": "LABEL_12",
26
+ "13": "LABEL_13",
27
+ "14": "LABEL_14",
28
+ "15": "LABEL_15",
29
+ "16": "LABEL_16",
30
+ "17": "LABEL_17"
31
  },
32
  "initializer_range": 0.02,
33
  "intermediate_size": 3072,
34
  "label2id": {
35
+ "LABEL_0": 0,
36
+ "LABEL_1": 1,
37
+ "LABEL_10": 10,
38
+ "LABEL_11": 11,
39
+ "LABEL_12": 12,
40
+ "LABEL_13": 13,
41
+ "LABEL_14": 14,
42
+ "LABEL_15": 15,
43
+ "LABEL_16": 16,
44
+ "LABEL_17": 17,
45
+ "LABEL_2": 2,
46
+ "LABEL_3": 3,
47
+ "LABEL_4": 4,
48
+ "LABEL_5": 5,
49
+ "LABEL_6": 6,
50
+ "LABEL_7": 7,
51
+ "LABEL_8": 8,
52
+ "LABEL_9": 9
53
  },
54
  "layer_norm_eps": 1e-12,
55
  "max_position_embeddings": 512,
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3ac18c0ad98d0038c47a7fdff3c1c233c67d38d0accfe9ff59cf4506568ea5db
3
  size 438057586
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:116462010c7257bebb94dcae1576e4d27d367feae196641b5b0d56bf006d7f67
3
  size 438057586
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ab842ae6c73e55691a6fa58cc64c5e96c7ba8c8074a3a9bc4cb9a2251bf29012
3
  size 3576
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e339ada0cb016077ae43fbacd80614bba57b4f9d5d2915bee4bddc1956e4108b
3
  size 3576