Add tokenizer files

Browse files

Files changed (5) hide show

README.md +27 -73
config.json +36 -36
pytorch_model.bin +1 -1
tokenizer.json +0 -0
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -5,58 +5,33 @@ tags:
 metrics:
 - accuracy
 model-index:
-- name: malaysia-news-classification-bert-english
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# malaysia-news-classification-bert-english
-This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on tnwei/ms-newspapers dataset.
 It achieves the following results on the evaluation set:
-- Loss: 1.1168
-- Accuracy: 0.8948
 ## Model description
-This BERT-based classification model is specifically designed to categorize English-language news articles from Malaysia into 18 distinct labels, allowing for nuanced understanding of local news themes. The model utilizes a pre-trained BERT (Bidirectional Encoder Representations from Transformers) architecture, renowned for its effectiveness in natural language processing tasks due to its deep understanding of language context.
 ## Intended uses & limitations
-The BERT-based classification model is designed to categorize English-language news articles from Malaysia into 18 predefined categories. The primary goal is to accurately reflect and align with Malaysia's unique cultural and contextual nuances within the news.
-While the model is optimized for Malaysian news content, it has several limitations:
-Cultural and Contextual Specificity: The model is specifically trained to interpret and categorize news based on Malaysia's unique cultural and contextual framework. As a result, its accuracy and relevance drop significantly when applied to news content from other countries or in different languages.
-- Generalizability: The model's training on Malaysian-specific news content limits its generalizability to broader, international contexts. It may not perform well with news from other regions as it may not correctly interpret cultural nuances, idiomatic expressions, or context-specific references not present in Malaysian news.
-- Dynamic News Landscape: The model may require frequent retraining to stay relevant due to the dynamic nature of news and ongoing cultural and societal changes. What is considered an important category or context might evolve, necessitating updates to both the model and its training data.
-- Bias and Sensitivity: Like any data-driven model, there is a risk of inheriting biases from the training dataset. Careful consideration and continuous monitoring are needed to ensure that the model does not perpetuate any form of cultural bias or insensitivity.
 ## Training and evaluation data
-The model was trained and evaluated on a dataset comprising 18,003 English-language news articles from Malaysia, categorized into 18 distinct classes corresponding to various news categories. This dataset was split into two subsets:
-- Training Data: 80% of the dataset (approximately 14,402 articles) was used for training. This substantial portion ensures that the model has ample examples to learn from, encompassing a wide array of topics and linguistic nuances.
-- Validation Data: The remaining 20% (approximately 3,601 articles) was used for validation. This set is crucial for gauging the model's performance and generalization on unseen data, helping to mitigate overfitting and bias.
-### Evaluation Strategy:
-Regular evaluations were conducted after certain numbers of training steps to monitor the model’s performance and adjust parameters if necessary. This frequent evaluation helps identify the best model configuration during training and avoids extensive periods of potential overfitting.
-### Performance Metrics:
-The model's performance was assessed using the loss and accuracy metrics:
-- Training Loss: Showed a consistent decrease from 0.6359 in the first epoch to effectively zero by the last epoch, indicating good learning progress.
-- Validation Loss: Increased over epochs, suggesting issues with model generalization despite decreasing training loss.
-- Accuracy: Increased to approximately 89.4751% by the end of the training, reflecting the model's ability to correctly classify a high percentage of the validation set.
 ## Training procedure
-Training was conducted over 16 epochs, with the following parameters configured to optimize learning:
-- Batch Size: An instantaneous batch size of 8 was used per device, with no parallel or distributed settings applied, resulting in a total train batch size of 8.
-- Optimization Steps: The model completed a total of 28,816 optimization steps, aligning with the batch size and data volume to ensure comprehensive exposure to the training data.
-- Optimizer: The model used the AdamW optimizer, a variant of Adam better suited for these types of models as it handles weight decay more effectively. However, a future warning was noted suggesting the transition to PyTorch’s native AdamW implementation for future uses.
-- Gradient Accumulation: The training utilized a gradient accumulation strategy, accumulating gradients over steps to enhance training stability and performance on smaller batch sizes.
 ### Training hyperparameters
@@ -70,52 +45,31 @@ The following hyperparameters were used during training:
 - num_epochs: 16
 - mixed_precision_training: Native AMP
-## Label Mappings
-This model can predict the following labels:
-- `0`: Election
-- `1`: Political Issue
-- `2`: Corruption
-- `3`: Democracy
-- `4`: Economic Growth
-- `5`: Economic Disparity
-- `6`: Economic Subsidy
-- `7`: Ethnic Discrimination
-- `8`: Ethnic Relation
-- `9`: Ethnic Culture
-- `10`: Religious Issue
-- `11`: Business and Finance
-- `12`: Sport
-- `13`: Food
-- `14`: Entertainment
-- `15`: Environmental Issue
-- `16`: Domestic News
-- `17`: World News
 ### Training results
-| Training Loss | Epoch | Step  | Validation Loss | Accuracy |
-|:-------------:|:-----:|:-----:|:---------------:|:--------:|
-| 0.6359        | 1.0   | 1801  | 0.5633          | 0.8453   |
-| 0.4081        | 2.0   | 3602  | 0.6275          | 0.8686   |
-| 0.2978        | 3.0   | 5403  | 0.6597          | 0.8792   |
-| 0.1796        | 4.0   | 7204  | 0.7368          | 0.8750   |
-| 0.1567        | 5.0   | 9005  | 0.9289          | 0.8636   |
-| 0.0998        | 6.0   | 10806 | 0.9482          | 0.8792   |
-| 0.0798        | 7.0   | 12607 | 0.9669          | 0.8781   |
-| 0.0619        | 8.0   | 14408 | 0.9942          | 0.8859   |
-| 0.0668        | 9.0   | 16209 | 1.0687          | 0.8781   |
-| 0.0298        | 10.0  | 18010 | 1.1998          | 0.8711   |
-| 0.0206        | 11.0  | 19811 | 1.2359          | 0.8775   |
-| 0.0174        | 12.0  | 21612 | 1.1635          | 0.8759   |
-| 0.013         | 13.0  | 23413 | 1.1226          | 0.8825   |
-| 0.0051        | 14.0  | 25214 | 1.1463          | 0.8848   |
-| 0.0073        | 15.0  | 27015 | 1.1652          | 0.8889   |
-| 0.0           | 16.0  | 28816 | 1.1168          | 0.8948   |
 ### Framework versions
 - Transformers 4.18.0
 - Pytorch 2.2.1+cu121
-- Datasets 2.18.0
 - Tokenizers 0.12.1

 metrics:
 - accuracy
 model-index:
+- name: malaysia-news-classification-bert-english-skewness-fixed
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# malaysia-news-classification-bert-english-skewness-fixed
+This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on an unknown dataset.
 It achieves the following results on the evaluation set:
+- Loss: 1.2051
+- Accuracy: 0.8436
 ## Model description
+More information needed
 ## Intended uses & limitations
+More information needed
 ## Training and evaluation data
+More information needed
 ## Training procedure
 ### Training hyperparameters
 - num_epochs: 16
 - mixed_precision_training: Native AMP
 ### Training results
+| Training Loss | Epoch | Step | Validation Loss | Accuracy |
+|:-------------:|:-----:|:----:|:---------------:|:--------:|
+| No log        | 1.0   | 358  | 0.9357          | 0.7486   |
+| 1.3554        | 2.0   | 716  | 0.9041          | 0.7807   |
+| 0.4851        | 3.0   | 1074 | 0.7842          | 0.8282   |
+| 0.4851        | 4.0   | 1432 | 0.9478          | 0.8226   |
+| 0.2558        | 5.0   | 1790 | 1.0765          | 0.8282   |
+| 0.1084        | 6.0   | 2148 | 1.1310          | 0.8380   |
+| 0.0625        | 7.0   | 2506 | 1.0999          | 0.8464   |
+| 0.0625        | 8.0   | 2864 | 1.1391          | 0.8408   |
+| 0.0301        | 9.0   | 3222 | 1.1036          | 0.8506   |
+| 0.0171        | 10.0  | 3580 | 1.0765          | 0.8534   |
+| 0.0171        | 11.0  | 3938 | 1.1291          | 0.8506   |
+| 0.0129        | 12.0  | 4296 | 1.1360          | 0.8520   |
+| 0.0035        | 13.0  | 4654 | 1.1619          | 0.8450   |
+| 0.0039        | 14.0  | 5012 | 1.1727          | 0.8534   |
+| 0.0039        | 15.0  | 5370 | 1.2079          | 0.8408   |
+| 0.0031        | 16.0  | 5728 | 1.2051          | 0.8436   |
 ### Framework versions
 - Transformers 4.18.0
 - Pytorch 2.2.1+cu121
+- Datasets 2.19.0
 - Tokenizers 0.12.1

config.json CHANGED Viewed

@@ -10,46 +10,46 @@
   "hidden_dropout_prob": 0.1,
   "hidden_size": 768,
   "id2label": {
-    "0": "Election",
-    "1": "Political Issue",
-    "2": "Corruption",
-    "3": "Democracy",
-    "4": "Economic Growth",
-    "5": "Economic Disparity",
-    "6": "Economic Subsidy",
-    "7": "Ethnic Discrimination",
-    "8": "Ethnic Relation",
-    "9": "Ethnic Culture",
-    "10": "Religious Issue",
-    "11": "Business and Finance",
-    "12": "Sport",
-    "13": "Food",
-    "14": "Entertainment",
-    "15": "Environmental Issues",
-    "16": "Domestic News",
-    "17": "World News"
   },
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "label2id": {
-    "Election": 0,
-    "Political Issue": 1,
-    "Corruption": 2,
-    "Democracy": 3,
-    "Economic Growth": 4,
-    "Economic Disparity": 5,
-    "Economic Subsidy": 6,
-    "Ethnic Discrimination": 7,
-    "Ethnic Relation": 8,
-    "Ethnic Culture": 9,
-    "Religious Issue": 10,
-    "Business and Finance": 11,
-    "Sport": 12,
-    "Food": 13,
-    "Entertainment": 14,
-    "Environmental Issues": 15,
-    "Domestic News": 16,
-    "World News": 17
   },
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 512,

   "hidden_dropout_prob": 0.1,
   "hidden_size": 768,
   "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13",
+    "14": "LABEL_14",
+    "15": "LABEL_15",
+    "16": "LABEL_16",
+    "17": "LABEL_17"
   },
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_11": 11,
+    "LABEL_12": 12,
+    "LABEL_13": 13,
+    "LABEL_14": 14,
+    "LABEL_15": 15,
+    "LABEL_16": 16,
+    "LABEL_17": 17,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5,
+    "LABEL_6": 6,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
   },
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 512,

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3ac18c0ad98d0038c47a7fdff3c1c233c67d38d0accfe9ff59cf4506568ea5db
 size 438057586

 version https://git-lfs.github.com/spec/v1
+oid sha256:116462010c7257bebb94dcae1576e4d27d367feae196641b5b0d56bf006d7f67
 size 438057586

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ab842ae6c73e55691a6fa58cc64c5e96c7ba8c8074a3a9bc4cb9a2251bf29012
 size 3576

 version https://git-lfs.github.com/spec/v1
+oid sha256:e339ada0cb016077ae43fbacd80614bba57b4f9d5d2915bee4bddc1956e4108b
 size 3576