File size: 5,114 Bytes
af491e9
5febdea
 
41253a5
5febdea
 
 
1d59f5a
 
41253a5
af491e9
41253a5
af491e9
 
 
 
1d59f5a
5febdea
 
1d59f5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5febdea
 
 
1d59f5a
 
 
 
9b4b047
1d59f5a
 
 
 
5febdea
 
9b4b047
 
5febdea
 
 
1d59f5a
5febdea
1d59f5a
 
 
 
 
 
 
 
 
 
 
 
5febdea
1d59f5a
5febdea
 
 
 
 
 
 
 
 
 
 
 
 
41253a5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: segmentation_scores
datasets:
- transformersegmentation/CHILDES_EnglishNA
tags:
- evaluate
- metric
language:
- en
description: ' metric for word segmentation scores '
sdk: gradio
sdk_version: 4.38.1
app_file: app.py
pinned: false
---

# Metric Card for Segmentation Scores

## Metric Description

There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively.

For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be:

| Score         | Value     |
|--------------|-----------|
| Boundary Precision | 1.0      |
| Boundary Recall      | 0.8  |
| Boundary F-Score      | 0.89  |
| Token Precision | 0.8      |
| Token Recall      | 0.67  |
| Token F-Score      | 0.73  |
| Type Precision | 0.8      |
| Type Recall      | 0.8  |
| Type F-Score      | 0.8  |

Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence). 

## How to Use

At minimum, this metric requires predictions and references as inputs.

```python
 >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
 >>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"])
 >>> print(results)
 {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}

```

### Inputs
- **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".
- **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".

### Output Values

All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.

- **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries.
- **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries.
- **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries.
- **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries.
- **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries.
- **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries.
- **token_precision**(`float`): Token/Word precision score.
- **token_recall**(`float`): Token/Word recall score.
- **token_fscore**(`float`): Token/Word F-score.
- **type_precision**(`float`): Type/Lexicon precision score.
- **type_recall**(`float`): Type/Lexicon recall score.
- **type_fscore**(`float`): Type/Lexicon F-score score.

<!-- 
#### Values from Popular Papers
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*

### Examples
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*

## Limitations and Bias
*Note any known limitations or biases that the metric has, with links and references if possible.*

## Citation
*Cite the source where this metric was introduced.*

## Further References
*Add any useful further references.* -->