Update README.md
Browse files
README.md
CHANGED
@@ -130,20 +130,20 @@ The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours,
|
|
130 |
|
131 |
# Evaluation
|
132 |
|
133 |
-
The model is evaluated on the SAS test set using SacreBLEU, METEOR, BERTScore,
|
134 |
|
135 |
## Metrics
|
136 |
<details>
|
137 |
<summary> Click to expand </summary>
|
138 |
- [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu): SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
|
139 |
- [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore): BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
|
140 |
-
- [
|
141 |
- [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor): METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
|
142 |
- [SARI](https://huggingface.co/spaces/evaluate-metric/sari): SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
|
143 |
- [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
|
144 |
</details>
|
145 |
|
146 |
-
Implementations of SacreBLEU, BERT Score,
|
147 |
|
148 |
|
149 |
## Results
|
@@ -155,9 +155,9 @@ We tested our model on the SAS test set (200 samples). We generate 10 lay summar
|
|
155 |
|----------------|---------|
|
156 |
| SacreBLEU↑ | 25.60 |
|
157 |
| BERT Score F1↑ | 90.14 |
|
158 |
-
|
|
159 |
-
|
|
160 |
-
|
|
161 |
| METEOR↑ | 43.75 |
|
162 |
| SARI↑ | 51.96 |
|
163 |
| ARI↓ | 17.04 |
|
|
|
130 |
|
131 |
# Evaluation
|
132 |
|
133 |
+
The model is evaluated on the SAS test set using SacreBLEU, METEOR, BERTScore, ROUGE, SARI, and ARI.
|
134 |
|
135 |
## Metrics
|
136 |
<details>
|
137 |
<summary> Click to expand </summary>
|
138 |
- [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu): SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
|
139 |
- [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore): BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
|
140 |
+
- [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)-1/2/L: ROUGE is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
|
141 |
- [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor): METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
|
142 |
- [SARI](https://huggingface.co/spaces/evaluate-metric/sari): SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
|
143 |
- [The Automated Readability Index (ARI)](https://www.readabilityformulas.com/automated-readability-index.php): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
|
144 |
</details>
|
145 |
|
146 |
+
Implementations of SacreBLEU, BERT Score, ROUGE, METEOR, and SARI are from Huggingface [`evaluate`](https://pypi.org/project/evaluate/) v.0.3.0. ARI is from [`py-readability-metrics`](https://pypi.org/project/py-readability-metrics/) v.1.4.5.
|
147 |
|
148 |
|
149 |
## Results
|
|
|
155 |
|----------------|---------|
|
156 |
| SacreBLEU↑ | 25.60 |
|
157 |
| BERT Score F1↑ | 90.14 |
|
158 |
+
| ROUGE-1↑ | 52.28 |
|
159 |
+
| ROUGE-2↑ | 29.61 |
|
160 |
+
| ROUGE-L↑ | 38.02 |
|
161 |
| METEOR↑ | 43.75 |
|
162 |
| SARI↑ | 51.96 |
|
163 |
| ARI↓ | 17.04 |
|