codebyzeb commited on
Commit
1d59f5a
1 Parent(s): 5febdea

Implement metric

Browse files
Files changed (3) hide show
  1. README.md +48 -12
  2. segmentation_scores.py +258 -25
  3. tests.py +6 -11
README.md CHANGED
@@ -5,35 +5,71 @@ datasets:
5
  tags:
6
  - evaluate
7
  - metric
8
- description: "TODO: add a description here"
 
 
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
13
  ---
14
 
15
- # Metric Card for segmentation_scores
16
-
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
 
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
 
 
 
 
 
 
 
 
36
 
 
37
  #### Values from Popular Papers
38
  *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
@@ -47,4 +83,4 @@ pinned: false
47
  *Cite the source where this metric was introduced.*
48
 
49
  ## Further References
50
- *Add any useful further references.*
 
5
  tags:
6
  - evaluate
7
  - metric
8
+ language:
9
+ - en
10
+ description: " metric for word segmentation scores "
11
  sdk: gradio
12
  sdk_version: 3.19.1
13
  app_file: app.py
14
  pinned: false
15
  ---
16
 
17
+ # Metric Card for Segmentation Scores
 
 
18
 
19
  ## Metric Description
20
+
21
+ There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively.
22
+
23
+ For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be:
24
+
25
+ | Score | Value |
26
+ |--------------|-----------|
27
+ | Boundary Precision | 1.0 |
28
+ | Boundary Recall | 0.8 |
29
+ | Boundary F-Score | 0.89 |
30
+ | Token Precision | 0.8 |
31
+ | Token Recall | 0.67 |
32
+ | Token F-Score | 0.73 |
33
+ | Type Precision | 0.8 |
34
+ | Type Recall | 0.8 |
35
+ | Type F-Score | 0.8 |
36
+
37
+ Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).
38
 
39
  ## How to Use
 
40
 
41
+ At minimum, this metric requires predictions and references as inputs.
42
+
43
+ ```python
44
+ >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
45
+ >>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
46
+ >>> print(results)
47
+ {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
48
+
49
+ ```
50
 
51
  ### Inputs
52
+ - **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with ";eword".
53
+ - **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with ";eword".
54
 
55
  ### Output Values
56
 
57
+ All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.
58
 
59
+ - **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries.
60
+ - **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries.
61
+ - **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries.
62
+ - **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries.
63
+ - **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries.
64
+ - **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries.
65
+ - **token_precision**(`float`): Token/Word precision score.
66
+ - **token_recall**(`float`): Token/Word recall score.
67
+ - **token_fscore**(`float`): Token/Word F-score.
68
+ - **type_precision**(`float`): Type/Lexicon precision score.
69
+ - **type_recall**(`float`): Type/Lexicon recall score.
70
+ - **type_fscore**(`float`): Type/Lexicon F-score score.
71
 
72
+ <!--
73
  #### Values from Popular Papers
74
  *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
75
 
 
83
  *Cite the source where this metric was introduced.*
84
 
85
  ## Further References
86
+ *Add any useful further references.* -->
segmentation_scores.py CHANGED
@@ -11,7 +11,7 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
  import evaluate
17
  import datasets
@@ -28,33 +28,152 @@ year={2020}
28
 
29
  # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
  """
33
 
34
 
35
  # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
 
 
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
 
 
 
 
 
 
 
 
 
 
46
  Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
49
-
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
  >>> print(results)
53
- {'accuracy': 1.0}
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
@@ -71,13 +190,13 @@ class segmentation_scores(evaluate.Metric):
71
  inputs_description=_KWARGS_DESCRIPTION,
72
  # This defines the format of each prediction and reference
73
  features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
  }),
77
  # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
  # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
  reference_urls=["http://path.to.reference.url/new_module"]
82
  )
83
 
@@ -86,10 +205,124 @@ class segmentation_scores(evaluate.Metric):
86
  # TODO: Download external resources if needed
87
  pass
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  return {
94
- "accuracy": accuracy,
 
 
 
 
 
 
 
 
 
 
 
95
  }
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """ Segmentation scores evaluation metrics"""
15
 
16
  import evaluate
17
  import datasets
 
28
 
29
  # TODO: Add description of the module here
30
  _DESCRIPTION = """\
31
+ This module computes segmentation scores for a list of predicted segmentations and gold segmentations.
32
  """
33
 
34
 
35
  # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
+ Calculates how good are predicted segmentations, using boundary, token and type scores.
38
  Args:
39
+ predictions: list of segmented utterances to score. Each predictions
40
+ should be a string with phonemes separated by spaces and estimated word boundaries
41
+ denoted by the token ';eword'.
42
+ references: list of segmented utterances to score. Each predictions
43
+ should be a string with phonemes separated by spaces and gold word boundaries
44
+ denoted by the token ';eword'.
45
  Returns:
46
+ type_fscore: lexicon f1 score
47
+ type_precision: lexicon precision
48
+ type_recall: lexicon recall
49
+ token_fscore: token f1 score
50
+ token_precision: token precision
51
+ token_recall: token recall
52
+ boundary_all_fscore: boundary f1 score, including utterance boundaries
53
+ boundary_all_precision: boundary precision, including utterance boundaries
54
+ boundary_all_recall: boundary recall, including utterance boundaries
55
+ boundary_noedge_fscore: boundary f1 score, excluding utterance boundaries
56
+ boundary_noedge_precision: boundary precision, excluding utterance boundaries
57
+ boundary_noedge_recall: boundary recall, excluding utterance boundaries
58
  Examples:
59
+ >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
60
+ >>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
 
 
 
61
  >>> print(results)
62
+ {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
63
  """
64
 
65
+ class TokenEvaluation(object):
66
+ """Evaluation of token f-score, precision and recall"""
67
+
68
+ def __init__(self):
69
+ self.test = 0
70
+ self.gold = 0
71
+ self.correct = 0
72
+ self.n = 0
73
+ self.n_exactmatch = 0
74
+
75
+ def precision(self):
76
+ return float(self.correct) / self.test if self.test != 0 else None
77
+
78
+ def recall(self):
79
+ return float(self.correct) / self.gold if self.gold != 0 else None
80
+
81
+ def fscore(self):
82
+ total = self.test + self.gold
83
+ return float(2 * self.correct) / total if total != 0 else None
84
+
85
+ def exact_match(self):
86
+ return float(self.n_exactmatch) / self.n if self.n else None
87
+
88
+ def update(self, test_set, gold_set):
89
+ self.n += 1
90
+
91
+ if test_set == gold_set:
92
+ self.n_exactmatch += 1
93
+
94
+ # omit empty items for type scoring (should not affect token
95
+ # scoring). Type lists are prepared with '_' where there is no
96
+ # match, to keep list lengths the same
97
+ self.test += len([x for x in test_set if x != "_"])
98
+ self.gold += len([x for x in gold_set if x != "_"])
99
+ self.correct += len(test_set & gold_set)
100
+
101
+ def update_lists(self, test_sets, gold_sets):
102
+ if len(test_sets) != len(gold_sets):
103
+ raise ValueError(
104
+ "#words different in test and gold: {} != {}".format(
105
+ len(test_sets), len(gold_sets)
106
+ )
107
+ )
108
+
109
+ for t, g in zip(test_sets, gold_sets):
110
+ self.update(t, g)
111
+
112
+
113
+ class TypeEvaluation(TokenEvaluation):
114
+ """Evaluation of type f-score, precision and recall"""
115
+
116
+ @staticmethod
117
+ def lexicon_check(textlex, goldlex):
118
+ """Compare hypothesis and gold lexicons"""
119
+ textlist = []
120
+ goldlist = []
121
+ for w in textlex:
122
+ if w in goldlex:
123
+ # set up matching lists for the true positives
124
+ textlist.append(w)
125
+ goldlist.append(w)
126
+ else:
127
+ # false positives
128
+ textlist.append(w)
129
+ # ensure matching null element in text list
130
+ goldlist.append("_")
131
+
132
+ for w in goldlex:
133
+ if w not in goldlist:
134
+ # now for the false negatives
135
+ goldlist.append(w)
136
+ # ensure matching null element in text list
137
+ textlist.append("_")
138
+
139
+ textset = [{w} for w in textlist]
140
+ goldset = [{w} for w in goldlist]
141
+ return textset, goldset
142
+
143
+ def update_lists(self, text, gold):
144
+ lt, lg = self.lexicon_check(text, gold)
145
+ super(TypeEvaluation, self).update_lists(lt, lg)
146
+
147
+
148
+ class BoundaryEvaluation(TokenEvaluation):
149
+ @staticmethod
150
+ def get_boundary_positions(stringpos):
151
+ return [{idx for pair in line for idx in pair} for line in stringpos]
152
+
153
+ def update_lists(self, text, gold):
154
+ lt = self.get_boundary_positions(text)
155
+ lg = self.get_boundary_positions(gold)
156
+ super(BoundaryEvaluation, self).update_lists(lt, lg)
157
+
158
+
159
+ class BoundaryNoEdgeEvaluation(BoundaryEvaluation):
160
+ @staticmethod
161
+ def get_boundary_positions(stringpos):
162
+ return [{left for left, _ in line if left > 0} for line in stringpos]
163
+
164
+
165
+ class _StringPos(object):
166
+ """Compute start and stop index of words in an utterance"""
167
+
168
+ def __init__(self):
169
+ self.idx = 0
170
+
171
+ def __call__(self, n):
172
+ """Return the position of the current word given its length `n`"""
173
+ start = self.idx
174
+ self.idx += n
175
+ return start, self.idx
176
+
177
 
178
 
179
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 
190
  inputs_description=_KWARGS_DESCRIPTION,
191
  # This defines the format of each prediction and reference
192
  features=datasets.Features({
193
+ 'predictions': datasets.Value('string'),
194
+ 'references': datasets.Value('string'),
195
  }),
196
  # Homepage of the module for documentation
197
+ homepage="https://huggingface.co/spaces/transformersegmentation/segmentation_scores",
198
  # Additional links to the codebase or references
199
+ codebase_urls=["http://github.com/codebyzeb/transformersegmentation"],
200
  reference_urls=["http://path.to.reference.url/new_module"]
201
  )
202
 
 
205
  # TODO: Download external resources if needed
206
  pass
207
 
208
+ def _process_data(self, text):
209
+ """ Load text data for evaluation
210
+ Parameters
211
+ ----------
212
+ text : list of str
213
+ The list of utterances to read for the evaluation.
214
+
215
+ Returns
216
+ -------
217
+ (words, positions, lexicon) : three lists
218
+ where `words` are the input utterances with word separators
219
+ removed, `positions` stores the start/stop index of each word
220
+ for each utterance, and `lexicon` is the list of words.
221
+ """
222
+ words = []
223
+ positions = []
224
+ lexicon = {}
225
+
226
+ # ignore empty lines
227
+ for utt in (utt for utt in text if utt.strip()):
228
+ # list of phones in the utterance with word seperator removed
229
+ phone_in_utterance = [
230
+ phone for phone in utt.split(" ") if phone != ";eword"
231
+ ]
232
+ words_in_utterance = (
233
+ "".join(
234
+ " " if phone == ";eword" else phone for phone in utt.split(" ")
235
+ )
236
+ .strip()
237
+ .split(" ")
238
+ )
239
+
240
+ words.append(phone_in_utterance)
241
+ for word in words_in_utterance:
242
+ lexicon[word] = 1
243
+ idx = _StringPos()
244
+ positions.append({idx(len(word)) for word in words_in_utterance})
245
+
246
+ # return the words lexicon as a sorted list
247
+ lexicon = sorted([k for k in lexicon.keys()])
248
+ return words, positions, lexicon
249
+
250
  def _compute(self, predictions, references):
251
+ """Scores a segmented text against its gold version
252
+ Parameters
253
+ ----------
254
+ predictions : sequence of str
255
+ A suite of word utterances, each string using ';eword' as as word separator.
256
+ references : sequence of str
257
+ A suite of word utterances, each string using ';eword' as as word separator.
258
+
259
+ Returns
260
+ -------
261
+ scores : dict
262
+ A dictionary with the following entries:
263
+ * 'type_fscore'
264
+ * 'type_precision'
265
+ * 'type_recall'
266
+ * 'token_fscore'
267
+ * 'token_precision'
268
+ * 'token_recall'
269
+ * 'boundary_all_fscore'
270
+ * 'boundary_all_precision'
271
+ * 'boundary_all_recall'
272
+ * 'boundary_noedge_fscore'
273
+ * 'boundary_noedge_precision'
274
+ * 'boundary_noedge_recall'
275
+
276
+ Raises
277
+ ------
278
+ ValueError
279
+ If `gold` and `text` have different size or differ in tokens
280
+ """
281
+ text_words, text_stringpos, text_lex = self._process_data(predictions)
282
+ gold_words, gold_stringpos, gold_lex = self._process_data(references)
283
+
284
+ if len(gold_words) != len(text_words):
285
+ raise ValueError(
286
+ "gold and train have different size: len(gold)={}, len(train)={}".format(
287
+ len(gold_words), len(text_words)
288
+ )
289
+ )
290
+
291
+ for i, (g, t) in enumerate(zip(gold_words, text_words)):
292
+ if g != t:
293
+ raise ValueError(
294
+ 'gold and train differ at line {}: gold="{}", train="{}"'.format(
295
+ i + 1, g, t
296
+ )
297
+ )
298
+
299
+ # token evaluation
300
+ token_eval = TokenEvaluation()
301
+ token_eval.update_lists(text_stringpos, gold_stringpos)
302
+
303
+ # type evaluation
304
+ type_eval = TypeEvaluation()
305
+ type_eval.update_lists(text_lex, gold_lex)
306
+
307
+ # boundary evaluation (with edges)
308
+ boundary_eval = BoundaryEvaluation()
309
+ boundary_eval.update_lists(text_stringpos, gold_stringpos)
310
+
311
+ # boundary evaluation (no edges)
312
+ boundary_noedge_eval = BoundaryNoEdgeEvaluation()
313
+ boundary_noedge_eval.update_lists(text_stringpos, gold_stringpos)
314
+
315
  return {
316
+ "token_precision": token_eval.precision(),
317
+ "token_recall": token_eval.recall(),
318
+ "token_fscore": token_eval.fscore(),
319
+ "type_precision": type_eval.precision(),
320
+ "type_recall": type_eval.recall(),
321
+ "type_fscore": type_eval.fscore(),
322
+ "boundary_all_precision": boundary_eval.precision(),
323
+ "boundary_all_recall": boundary_eval.recall(),
324
+ "boundary_all_fscore": boundary_eval.fscore(),
325
+ "boundary_noedge_precision": boundary_noedge_eval.precision(),
326
+ "boundary_noedge_recall": boundary_noedge_eval.recall(),
327
+ "boundary_noedge_fscore": boundary_noedge_eval.fscore(),
328
  }
tests.py CHANGED
@@ -1,17 +1,12 @@
1
  test_cases = [
2
  {
3
- "predictions": [0, 0],
4
- "references": [1, 1],
5
- "result": {"metric_score": 0}
6
  },
7
  {
8
- "predictions": [1, 1],
9
- "references": [1, 1],
10
- "result": {"metric_score": 1}
11
- },
12
- {
13
- "predictions": [1, 0],
14
- "references": [1, 1],
15
- "result": {"metric_score": 0.5}
16
  }
17
  ]
 
1
  test_cases = [
2
  {
3
+ "predictions": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
4
+ "references": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
5
+ "result": {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
6
  },
7
  {
8
+ "predictions": ["thedog is in the boat"],
9
+ "references": ["the dog is in the boat"],
10
+ "result": {'type_fscore': 0.8, 'type_precision': 0.8, 'type_recall': 0.8, 'token_fscore': 0.73, 'token_precision': 0.8, 'token_recall': 0.67, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 0.94, 'boundary_noedge_fscore': 0.89, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 0.8}
 
 
 
 
 
11
  }
12
  ]