thonyyy commited on
Commit
050293a
1 Parent(s): 52cd7e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -12
README.md CHANGED
@@ -6,33 +6,79 @@ model-index:
6
  results: []
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information Keras had access to. You should
10
- probably proofread and complete it, then remove this comment. -->
11
 
12
  # pegasus-indonesian-base_finetune
13
 
14
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
 
 
15
  It achieves the following results on the evaluation set:
16
  - Train Loss: 1.6196
17
  - Train Accuracy: 0.1079
18
  - Validation Loss: 1.4097
19
  - Validation Accuracy: 0.1153
20
  - Train Lr: 0.00013661868
21
- - Epoch: 1
22
 
23
- ## Model description
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
28
 
29
- More information needed
 
 
 
 
30
 
31
  ## Training and evaluation data
32
-
33
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Training procedure
 
36
 
37
  ### Training hyperparameters
38
 
@@ -44,8 +90,8 @@ The following hyperparameters were used during training:
44
 
45
  | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Train Lr | Epoch |
46
  |:----------:|:--------------:|:---------------:|:-------------------:|:-------------:|:-----:|
47
- | 2.3484 | 0.0859 | 1.6304 | 0.1080 | 0.00013661868 | 0 |
48
- | 1.6196 | 0.1079 | 1.4097 | 0.1153 | 0.00013661868 | 1 |
49
 
50
 
51
  ### Framework versions
@@ -54,3 +100,6 @@ The following hyperparameters were used during training:
54
  - TensorFlow 2.12.0
55
  - Datasets 2.13.1
56
  - Tokenizers 0.13.3
 
 
 
 
6
  results: []
7
  ---
8
 
9
+
 
10
 
11
  # pegasus-indonesian-base_finetune
12
 
13
+ Github : [PegasusAnthony](https://github.com/nicholaswilven/PEGASUSAnthony/tree/master)
14
+
15
+ This model is a fine-tuned version of [pegasus-indonesian-base_pretrained](https://huggingface.co/thonyyy/pegasus-indonesian-base_pretrained) on [Indosum](https://paperswithcode.com/dataset/indosum), [Liputan6](https://paperswithcode.com/dataset/liputan6) and [XLSum](https://huggingface.co/datasets/csebuetnlp/xlsum)
16
  It achieves the following results on the evaluation set:
17
  - Train Loss: 1.6196
18
  - Train Accuracy: 0.1079
19
  - Validation Loss: 1.4097
20
  - Validation Accuracy: 0.1153
21
  - Train Lr: 0.00013661868
22
+ - Epoch: 2
23
 
24
+ ## Intended uses & limitations
25
 
26
+ This model is uncased, can't read special characters except "," and ".", having hard time understanding numbers, and performance only tested on news article text.
27
 
28
+ ## Performance
29
 
30
+ | datasets | rouge-1 | rouge-2 | rouge-L |
31
+ | ---- | ---- | ---- | ---- |
32
+ | Indosum | (TBA) | - | - |
33
+ | Liputan6 | (TBA) | - | - |
34
+ | XLSum | (TBA) | - | - |
35
 
36
  ## Training and evaluation data
37
+ Finetune dataset:
38
+ 1.[Indosum](https://paperswithcode.com/dataset/indosum)
39
+ 2.[Liputan6](https://paperswithcode.com/dataset/liputan6)
40
+ 3.[XLSum](https://huggingface.co/datasets/csebuetnlp/xlsum)
41
+
42
+ ## Usage
43
+
44
+ ```python
45
+ # Load model and tokenizer
46
+ from transformers import TFPegasusForConditionalGeneration, PegasusTokenizerFast
47
+ model_name = "thonyyy/pegasus-indonesian-base_finetune"
48
+ model = TFPegasusForConditionalGeneration.from_pretrained(model_name)
49
+ tokenizer = PegasusTokenizerFast.from_pretrained(model_name)
50
+
51
+ # Main function to clean text, removes link, bullet point, non ASCII char, parantheses,
52
+ # punctuation except "," and ".", numbers with dot (enumerating), extra whitespaces, too short sentences.
53
+ import re
54
+ import unicodedata
55
+ def text_cleaning(input_string):
56
+ lowercase = input_string.lower()
57
+ remove_link = re.sub(r'(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)', '', lowercase).replace("&amp;","&")
58
+ remove_bullet = "\n".join([T for T in remove_link.split('\n') if '•' not in T and "baca juga:" not in T])
59
+ remove_accented = unicodedata.normalize('NFKD', remove_bullet).encode('ascii', 'ignore').decode('utf-8', 'ignore')
60
+ remove_parentheses = re.sub("([\(\|]).*?([\)\|])", "\g<1>\g<2>", remove_accented)
61
+ remove_punc = re.sub(r"[^\w\d.\s]+",' ', remove_parentheses)
62
+ remove_num_dot = re.sub(r"(?<=\d)\.|\.(?=\d)|(?<=#)\.","", remove_punc)
63
+ remove_extra_whitespace = re.sub(r'^\s*|\s\s*', ' ', remove_num_dot).strip()
64
+ return ".".join([s for s in remove_extra_whitespace.strip().split('.') if len(s.strip())>10]).replace("_","")
65
+
66
+ # Article to summarize
67
+ sample_article="""
68
+ Dana Moneter Internasional (IMF) menilai Indonesia telah menunjukkan pemulihan ekonomi yang baik pasca pandemi melalui kinerja makroekonomi yang kuat, didukung penerapan kebijakan moneter dan fiskal secara berhati-hati. Kebijakan forward looking dan sinergi telah berhasil membawa Indonesia menghadapi tantangan global pada tahun 2022 dengan pertumbuhan yang sehat, tekanan inflasi yang menurun, dan sistem keuangan yang stabil. Bank Indonesia menyambut baik hasil asesmen IMF atas perekonomian Indonesia dalam laporan Article IV Consultation tahun 2023 yang ​dirilis hari ini (26/6).
69
+ Dewan Direktur IMF menyampaikan apresiasi dan catatan positif terhadap berbagai kebijakan yang ditempuh otoritas Indonesia selama tahun 2022. Pertama, keberhasilan otoritas untuk kembali kepada batas maksimal defisit fiskal 3%, lebih cepat dari yang diperkirakan dan komitmen otoritas untuk menerapkan disiplin fiskal. Kedua, penerapan kebijakan moneter yang memadai untuk menjaga stabilitas harga. Ketiga, ketahanan sektor keuangan yang tetap terjaga. Keempat, penerapan UU Cipta Kerja serta UU Pengembangan dan Penguatan Sektor Keuangan, dengan memastikan implementasi yang tepat dan keberlanjutan momentum reformasi untuk mendorong kemudahan berinvestasi, meningkatkan pendalaman pasar keuangan, dan memitigasi dampak scarring dari pandemi. Kelima, strategi diversifikasi Indonesia yang fokus pada upaya hilirisasi dalam rangka meningkatkan nilai tambah ekspor. Keenam, komitmen otoritas untuk mengurangi emisi gas rumah kaca dan deforestasi.
70
+ """
71
+
72
+ # Generate summary
73
+ x = tokenizer(text_cleaning(t), return_tensors = 'tf')
74
+ y = model.generate(**x)
75
+ suummary = tokenizer.batch_decode(y, skip_special_tokens=True)
76
+ print(summary)
77
+
78
+ ```
79
 
80
  ## Training procedure
81
+ For replication, go to GitHub page
82
 
83
  ### Training hyperparameters
84
 
 
90
 
91
  | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Train Lr | Epoch |
92
  |:----------:|:--------------:|:---------------:|:-------------------:|:-------------:|:-----:|
93
+ | 2.3484 | 0.0859 | 1.6304 | 0.1080 | 0.00013661868 | 1 |
94
+ | 1.6196 | 0.1079 | 1.4097 | 0.1153 | 0.00013661868 | 2 |
95
 
96
 
97
  ### Framework versions
 
100
  - TensorFlow 2.12.0
101
  - Datasets 2.13.1
102
  - Tokenizers 0.13.3
103
+
104
+ ### Special Thanks
105
+ Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC)