Spaces:
Runtime error
Runtime error
Implement metric
Browse files- README.md +48 -12
- segmentation_scores.py +258 -25
- tests.py +6 -11
README.md
CHANGED
@@ -5,35 +5,71 @@ datasets:
|
|
5 |
tags:
|
6 |
- evaluate
|
7 |
- metric
|
8 |
-
|
|
|
|
|
9 |
sdk: gradio
|
10 |
sdk_version: 3.19.1
|
11 |
app_file: app.py
|
12 |
pinned: false
|
13 |
---
|
14 |
|
15 |
-
# Metric Card for
|
16 |
-
|
17 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
18 |
|
19 |
## Metric Description
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
## How to Use
|
23 |
-
*Give general statement of how to use the metric*
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
### Inputs
|
28 |
-
|
29 |
-
- **
|
30 |
|
31 |
### Output Values
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
|
|
37 |
#### Values from Popular Papers
|
38 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
39 |
|
@@ -47,4 +83,4 @@ pinned: false
|
|
47 |
*Cite the source where this metric was introduced.*
|
48 |
|
49 |
## Further References
|
50 |
-
*Add any useful further references.*
|
|
|
5 |
tags:
|
6 |
- evaluate
|
7 |
- metric
|
8 |
+
language:
|
9 |
+
- en
|
10 |
+
description: " metric for word segmentation scores "
|
11 |
sdk: gradio
|
12 |
sdk_version: 3.19.1
|
13 |
app_file: app.py
|
14 |
pinned: false
|
15 |
---
|
16 |
|
17 |
+
# Metric Card for Segmentation Scores
|
|
|
|
|
18 |
|
19 |
## Metric Description
|
20 |
+
|
21 |
+
There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively.
|
22 |
+
|
23 |
+
For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be:
|
24 |
+
|
25 |
+
| Score | Value |
|
26 |
+
|--------------|-----------|
|
27 |
+
| Boundary Precision | 1.0 |
|
28 |
+
| Boundary Recall | 0.8 |
|
29 |
+
| Boundary F-Score | 0.89 |
|
30 |
+
| Token Precision | 0.8 |
|
31 |
+
| Token Recall | 0.67 |
|
32 |
+
| Token F-Score | 0.73 |
|
33 |
+
| Type Precision | 0.8 |
|
34 |
+
| Type Recall | 0.8 |
|
35 |
+
| Type F-Score | 0.8 |
|
36 |
+
|
37 |
+
Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).
|
38 |
|
39 |
## How to Use
|
|
|
40 |
|
41 |
+
At minimum, this metric requires predictions and references as inputs.
|
42 |
+
|
43 |
+
```python
|
44 |
+
>>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
|
45 |
+
>>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
|
46 |
+
>>> print(results)
|
47 |
+
{'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
|
48 |
+
|
49 |
+
```
|
50 |
|
51 |
### Inputs
|
52 |
+
- **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with ";eword".
|
53 |
+
- **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with ";eword".
|
54 |
|
55 |
### Output Values
|
56 |
|
57 |
+
All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.
|
58 |
|
59 |
+
- **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries.
|
60 |
+
- **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries.
|
61 |
+
- **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries.
|
62 |
+
- **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries.
|
63 |
+
- **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries.
|
64 |
+
- **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries.
|
65 |
+
- **token_precision**(`float`): Token/Word precision score.
|
66 |
+
- **token_recall**(`float`): Token/Word recall score.
|
67 |
+
- **token_fscore**(`float`): Token/Word F-score.
|
68 |
+
- **type_precision**(`float`): Type/Lexicon precision score.
|
69 |
+
- **type_recall**(`float`): Type/Lexicon recall score.
|
70 |
+
- **type_fscore**(`float`): Type/Lexicon F-score score.
|
71 |
|
72 |
+
<!--
|
73 |
#### Values from Popular Papers
|
74 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
75 |
|
|
|
83 |
*Cite the source where this metric was introduced.*
|
84 |
|
85 |
## Further References
|
86 |
+
*Add any useful further references.* -->
|
segmentation_scores.py
CHANGED
@@ -11,7 +11,7 @@
|
|
11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
# See the License for the specific language governing permissions and
|
13 |
# limitations under the License.
|
14 |
-
"""
|
15 |
|
16 |
import evaluate
|
17 |
import datasets
|
@@ -28,33 +28,152 @@ year={2020}
|
|
28 |
|
29 |
# TODO: Add description of the module here
|
30 |
_DESCRIPTION = """\
|
31 |
-
This
|
32 |
"""
|
33 |
|
34 |
|
35 |
# TODO: Add description of the arguments of the module here
|
36 |
_KWARGS_DESCRIPTION = """
|
37 |
-
Calculates how good are
|
38 |
Args:
|
39 |
-
predictions: list of
|
40 |
-
should be a string with
|
41 |
-
|
42 |
-
|
|
|
|
|
43 |
Returns:
|
44 |
-
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
Examples:
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
>>> my_new_module = evaluate.load("my_new_module")
|
51 |
-
>>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
|
52 |
>>> print(results)
|
53 |
-
{'
|
54 |
"""
|
55 |
|
56 |
-
|
57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
|
60 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
@@ -71,13 +190,13 @@ class segmentation_scores(evaluate.Metric):
|
|
71 |
inputs_description=_KWARGS_DESCRIPTION,
|
72 |
# This defines the format of each prediction and reference
|
73 |
features=datasets.Features({
|
74 |
-
'predictions': datasets.Value('
|
75 |
-
'references': datasets.Value('
|
76 |
}),
|
77 |
# Homepage of the module for documentation
|
78 |
-
homepage="
|
79 |
# Additional links to the codebase or references
|
80 |
-
codebase_urls=["http://github.com/
|
81 |
reference_urls=["http://path.to.reference.url/new_module"]
|
82 |
)
|
83 |
|
@@ -86,10 +205,124 @@ class segmentation_scores(evaluate.Metric):
|
|
86 |
# TODO: Download external resources if needed
|
87 |
pass
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
def _compute(self, predictions, references):
|
90 |
-
"""
|
91 |
-
|
92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
return {
|
94 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
}
|
|
|
11 |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
# See the License for the specific language governing permissions and
|
13 |
# limitations under the License.
|
14 |
+
""" Segmentation scores evaluation metrics"""
|
15 |
|
16 |
import evaluate
|
17 |
import datasets
|
|
|
28 |
|
29 |
# TODO: Add description of the module here
|
30 |
_DESCRIPTION = """\
|
31 |
+
This module computes segmentation scores for a list of predicted segmentations and gold segmentations.
|
32 |
"""
|
33 |
|
34 |
|
35 |
# TODO: Add description of the arguments of the module here
|
36 |
_KWARGS_DESCRIPTION = """
|
37 |
+
Calculates how good are predicted segmentations, using boundary, token and type scores.
|
38 |
Args:
|
39 |
+
predictions: list of segmented utterances to score. Each predictions
|
40 |
+
should be a string with phonemes separated by spaces and estimated word boundaries
|
41 |
+
denoted by the token ';eword'.
|
42 |
+
references: list of segmented utterances to score. Each predictions
|
43 |
+
should be a string with phonemes separated by spaces and gold word boundaries
|
44 |
+
denoted by the token ';eword'.
|
45 |
Returns:
|
46 |
+
type_fscore: lexicon f1 score
|
47 |
+
type_precision: lexicon precision
|
48 |
+
type_recall: lexicon recall
|
49 |
+
token_fscore: token f1 score
|
50 |
+
token_precision: token precision
|
51 |
+
token_recall: token recall
|
52 |
+
boundary_all_fscore: boundary f1 score, including utterance boundaries
|
53 |
+
boundary_all_precision: boundary precision, including utterance boundaries
|
54 |
+
boundary_all_recall: boundary recall, including utterance boundaries
|
55 |
+
boundary_noedge_fscore: boundary f1 score, excluding utterance boundaries
|
56 |
+
boundary_noedge_precision: boundary precision, excluding utterance boundaries
|
57 |
+
boundary_noedge_recall: boundary recall, excluding utterance boundaries
|
58 |
Examples:
|
59 |
+
>>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
|
60 |
+
>>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
|
|
|
|
|
|
|
61 |
>>> print(results)
|
62 |
+
{'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
|
63 |
"""
|
64 |
|
65 |
+
class TokenEvaluation(object):
|
66 |
+
"""Evaluation of token f-score, precision and recall"""
|
67 |
+
|
68 |
+
def __init__(self):
|
69 |
+
self.test = 0
|
70 |
+
self.gold = 0
|
71 |
+
self.correct = 0
|
72 |
+
self.n = 0
|
73 |
+
self.n_exactmatch = 0
|
74 |
+
|
75 |
+
def precision(self):
|
76 |
+
return float(self.correct) / self.test if self.test != 0 else None
|
77 |
+
|
78 |
+
def recall(self):
|
79 |
+
return float(self.correct) / self.gold if self.gold != 0 else None
|
80 |
+
|
81 |
+
def fscore(self):
|
82 |
+
total = self.test + self.gold
|
83 |
+
return float(2 * self.correct) / total if total != 0 else None
|
84 |
+
|
85 |
+
def exact_match(self):
|
86 |
+
return float(self.n_exactmatch) / self.n if self.n else None
|
87 |
+
|
88 |
+
def update(self, test_set, gold_set):
|
89 |
+
self.n += 1
|
90 |
+
|
91 |
+
if test_set == gold_set:
|
92 |
+
self.n_exactmatch += 1
|
93 |
+
|
94 |
+
# omit empty items for type scoring (should not affect token
|
95 |
+
# scoring). Type lists are prepared with '_' where there is no
|
96 |
+
# match, to keep list lengths the same
|
97 |
+
self.test += len([x for x in test_set if x != "_"])
|
98 |
+
self.gold += len([x for x in gold_set if x != "_"])
|
99 |
+
self.correct += len(test_set & gold_set)
|
100 |
+
|
101 |
+
def update_lists(self, test_sets, gold_sets):
|
102 |
+
if len(test_sets) != len(gold_sets):
|
103 |
+
raise ValueError(
|
104 |
+
"#words different in test and gold: {} != {}".format(
|
105 |
+
len(test_sets), len(gold_sets)
|
106 |
+
)
|
107 |
+
)
|
108 |
+
|
109 |
+
for t, g in zip(test_sets, gold_sets):
|
110 |
+
self.update(t, g)
|
111 |
+
|
112 |
+
|
113 |
+
class TypeEvaluation(TokenEvaluation):
|
114 |
+
"""Evaluation of type f-score, precision and recall"""
|
115 |
+
|
116 |
+
@staticmethod
|
117 |
+
def lexicon_check(textlex, goldlex):
|
118 |
+
"""Compare hypothesis and gold lexicons"""
|
119 |
+
textlist = []
|
120 |
+
goldlist = []
|
121 |
+
for w in textlex:
|
122 |
+
if w in goldlex:
|
123 |
+
# set up matching lists for the true positives
|
124 |
+
textlist.append(w)
|
125 |
+
goldlist.append(w)
|
126 |
+
else:
|
127 |
+
# false positives
|
128 |
+
textlist.append(w)
|
129 |
+
# ensure matching null element in text list
|
130 |
+
goldlist.append("_")
|
131 |
+
|
132 |
+
for w in goldlex:
|
133 |
+
if w not in goldlist:
|
134 |
+
# now for the false negatives
|
135 |
+
goldlist.append(w)
|
136 |
+
# ensure matching null element in text list
|
137 |
+
textlist.append("_")
|
138 |
+
|
139 |
+
textset = [{w} for w in textlist]
|
140 |
+
goldset = [{w} for w in goldlist]
|
141 |
+
return textset, goldset
|
142 |
+
|
143 |
+
def update_lists(self, text, gold):
|
144 |
+
lt, lg = self.lexicon_check(text, gold)
|
145 |
+
super(TypeEvaluation, self).update_lists(lt, lg)
|
146 |
+
|
147 |
+
|
148 |
+
class BoundaryEvaluation(TokenEvaluation):
|
149 |
+
@staticmethod
|
150 |
+
def get_boundary_positions(stringpos):
|
151 |
+
return [{idx for pair in line for idx in pair} for line in stringpos]
|
152 |
+
|
153 |
+
def update_lists(self, text, gold):
|
154 |
+
lt = self.get_boundary_positions(text)
|
155 |
+
lg = self.get_boundary_positions(gold)
|
156 |
+
super(BoundaryEvaluation, self).update_lists(lt, lg)
|
157 |
+
|
158 |
+
|
159 |
+
class BoundaryNoEdgeEvaluation(BoundaryEvaluation):
|
160 |
+
@staticmethod
|
161 |
+
def get_boundary_positions(stringpos):
|
162 |
+
return [{left for left, _ in line if left > 0} for line in stringpos]
|
163 |
+
|
164 |
+
|
165 |
+
class _StringPos(object):
|
166 |
+
"""Compute start and stop index of words in an utterance"""
|
167 |
+
|
168 |
+
def __init__(self):
|
169 |
+
self.idx = 0
|
170 |
+
|
171 |
+
def __call__(self, n):
|
172 |
+
"""Return the position of the current word given its length `n`"""
|
173 |
+
start = self.idx
|
174 |
+
self.idx += n
|
175 |
+
return start, self.idx
|
176 |
+
|
177 |
|
178 |
|
179 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
|
|
190 |
inputs_description=_KWARGS_DESCRIPTION,
|
191 |
# This defines the format of each prediction and reference
|
192 |
features=datasets.Features({
|
193 |
+
'predictions': datasets.Value('string'),
|
194 |
+
'references': datasets.Value('string'),
|
195 |
}),
|
196 |
# Homepage of the module for documentation
|
197 |
+
homepage="https://huggingface.co/spaces/transformersegmentation/segmentation_scores",
|
198 |
# Additional links to the codebase or references
|
199 |
+
codebase_urls=["http://github.com/codebyzeb/transformersegmentation"],
|
200 |
reference_urls=["http://path.to.reference.url/new_module"]
|
201 |
)
|
202 |
|
|
|
205 |
# TODO: Download external resources if needed
|
206 |
pass
|
207 |
|
208 |
+
def _process_data(self, text):
|
209 |
+
""" Load text data for evaluation
|
210 |
+
Parameters
|
211 |
+
----------
|
212 |
+
text : list of str
|
213 |
+
The list of utterances to read for the evaluation.
|
214 |
+
|
215 |
+
Returns
|
216 |
+
-------
|
217 |
+
(words, positions, lexicon) : three lists
|
218 |
+
where `words` are the input utterances with word separators
|
219 |
+
removed, `positions` stores the start/stop index of each word
|
220 |
+
for each utterance, and `lexicon` is the list of words.
|
221 |
+
"""
|
222 |
+
words = []
|
223 |
+
positions = []
|
224 |
+
lexicon = {}
|
225 |
+
|
226 |
+
# ignore empty lines
|
227 |
+
for utt in (utt for utt in text if utt.strip()):
|
228 |
+
# list of phones in the utterance with word seperator removed
|
229 |
+
phone_in_utterance = [
|
230 |
+
phone for phone in utt.split(" ") if phone != ";eword"
|
231 |
+
]
|
232 |
+
words_in_utterance = (
|
233 |
+
"".join(
|
234 |
+
" " if phone == ";eword" else phone for phone in utt.split(" ")
|
235 |
+
)
|
236 |
+
.strip()
|
237 |
+
.split(" ")
|
238 |
+
)
|
239 |
+
|
240 |
+
words.append(phone_in_utterance)
|
241 |
+
for word in words_in_utterance:
|
242 |
+
lexicon[word] = 1
|
243 |
+
idx = _StringPos()
|
244 |
+
positions.append({idx(len(word)) for word in words_in_utterance})
|
245 |
+
|
246 |
+
# return the words lexicon as a sorted list
|
247 |
+
lexicon = sorted([k for k in lexicon.keys()])
|
248 |
+
return words, positions, lexicon
|
249 |
+
|
250 |
def _compute(self, predictions, references):
|
251 |
+
"""Scores a segmented text against its gold version
|
252 |
+
Parameters
|
253 |
+
----------
|
254 |
+
predictions : sequence of str
|
255 |
+
A suite of word utterances, each string using ';eword' as as word separator.
|
256 |
+
references : sequence of str
|
257 |
+
A suite of word utterances, each string using ';eword' as as word separator.
|
258 |
+
|
259 |
+
Returns
|
260 |
+
-------
|
261 |
+
scores : dict
|
262 |
+
A dictionary with the following entries:
|
263 |
+
* 'type_fscore'
|
264 |
+
* 'type_precision'
|
265 |
+
* 'type_recall'
|
266 |
+
* 'token_fscore'
|
267 |
+
* 'token_precision'
|
268 |
+
* 'token_recall'
|
269 |
+
* 'boundary_all_fscore'
|
270 |
+
* 'boundary_all_precision'
|
271 |
+
* 'boundary_all_recall'
|
272 |
+
* 'boundary_noedge_fscore'
|
273 |
+
* 'boundary_noedge_precision'
|
274 |
+
* 'boundary_noedge_recall'
|
275 |
+
|
276 |
+
Raises
|
277 |
+
------
|
278 |
+
ValueError
|
279 |
+
If `gold` and `text` have different size or differ in tokens
|
280 |
+
"""
|
281 |
+
text_words, text_stringpos, text_lex = self._process_data(predictions)
|
282 |
+
gold_words, gold_stringpos, gold_lex = self._process_data(references)
|
283 |
+
|
284 |
+
if len(gold_words) != len(text_words):
|
285 |
+
raise ValueError(
|
286 |
+
"gold and train have different size: len(gold)={}, len(train)={}".format(
|
287 |
+
len(gold_words), len(text_words)
|
288 |
+
)
|
289 |
+
)
|
290 |
+
|
291 |
+
for i, (g, t) in enumerate(zip(gold_words, text_words)):
|
292 |
+
if g != t:
|
293 |
+
raise ValueError(
|
294 |
+
'gold and train differ at line {}: gold="{}", train="{}"'.format(
|
295 |
+
i + 1, g, t
|
296 |
+
)
|
297 |
+
)
|
298 |
+
|
299 |
+
# token evaluation
|
300 |
+
token_eval = TokenEvaluation()
|
301 |
+
token_eval.update_lists(text_stringpos, gold_stringpos)
|
302 |
+
|
303 |
+
# type evaluation
|
304 |
+
type_eval = TypeEvaluation()
|
305 |
+
type_eval.update_lists(text_lex, gold_lex)
|
306 |
+
|
307 |
+
# boundary evaluation (with edges)
|
308 |
+
boundary_eval = BoundaryEvaluation()
|
309 |
+
boundary_eval.update_lists(text_stringpos, gold_stringpos)
|
310 |
+
|
311 |
+
# boundary evaluation (no edges)
|
312 |
+
boundary_noedge_eval = BoundaryNoEdgeEvaluation()
|
313 |
+
boundary_noedge_eval.update_lists(text_stringpos, gold_stringpos)
|
314 |
+
|
315 |
return {
|
316 |
+
"token_precision": token_eval.precision(),
|
317 |
+
"token_recall": token_eval.recall(),
|
318 |
+
"token_fscore": token_eval.fscore(),
|
319 |
+
"type_precision": type_eval.precision(),
|
320 |
+
"type_recall": type_eval.recall(),
|
321 |
+
"type_fscore": type_eval.fscore(),
|
322 |
+
"boundary_all_precision": boundary_eval.precision(),
|
323 |
+
"boundary_all_recall": boundary_eval.recall(),
|
324 |
+
"boundary_all_fscore": boundary_eval.fscore(),
|
325 |
+
"boundary_noedge_precision": boundary_noedge_eval.precision(),
|
326 |
+
"boundary_noedge_recall": boundary_noedge_eval.recall(),
|
327 |
+
"boundary_noedge_fscore": boundary_noedge_eval.fscore(),
|
328 |
}
|
tests.py
CHANGED
@@ -1,17 +1,12 @@
|
|
1 |
test_cases = [
|
2 |
{
|
3 |
-
"predictions": [
|
4 |
-
"references": [
|
5 |
-
"result": {
|
6 |
},
|
7 |
{
|
8 |
-
"predictions": [
|
9 |
-
"references": [
|
10 |
-
"result": {
|
11 |
-
},
|
12 |
-
{
|
13 |
-
"predictions": [1, 0],
|
14 |
-
"references": [1, 1],
|
15 |
-
"result": {"metric_score": 0.5}
|
16 |
}
|
17 |
]
|
|
|
1 |
test_cases = [
|
2 |
{
|
3 |
+
"predictions": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
|
4 |
+
"references": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
|
5 |
+
"result": {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
|
6 |
},
|
7 |
{
|
8 |
+
"predictions": ["thedog is in the boat"],
|
9 |
+
"references": ["the dog is in the boat"],
|
10 |
+
"result": {'type_fscore': 0.8, 'type_precision': 0.8, 'type_recall': 0.8, 'token_fscore': 0.73, 'token_precision': 0.8, 'token_recall': 0.67, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 0.94, 'boundary_noedge_fscore': 0.89, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 0.8}
|
|
|
|
|
|
|
|
|
|
|
11 |
}
|
12 |
]
|