Spaces:
Runtime error
Runtime error
title: segmentation_scores | |
datasets: | |
- transformersegmentation/CHILDES_EnglishNA | |
tags: | |
- evaluate | |
- metric | |
language: | |
- en | |
description: ' metric for word segmentation scores ' | |
sdk: gradio | |
sdk_version: 4.38.1 | |
app_file: app.py | |
pinned: false | |
# Metric Card for Segmentation Scores | |
## Metric Description | |
There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively. | |
For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be: | |
| Score | Value | | |
|--------------|-----------| | |
| Boundary Precision | 1.0 | | |
| Boundary Recall | 0.8 | | |
| Boundary F-Score | 0.89 | | |
| Token Precision | 0.8 | | |
| Token Recall | 0.67 | | |
| Token F-Score | 0.73 | | |
| Type Precision | 0.8 | | |
| Type Recall | 0.8 | | |
| Type F-Score | 0.8 | | |
Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence). | |
## How to Use | |
At minimum, this metric requires predictions and references as inputs. | |
```python | |
>>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores") | |
>>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"]) | |
>>> print(results) | |
{'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0} | |
``` | |
### Inputs | |
- **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY". | |
- **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY". | |
### Output Values | |
All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy. | |
- **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries. | |
- **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries. | |
- **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries. | |
- **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries. | |
- **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries. | |
- **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries. | |
- **token_precision**(`float`): Token/Word precision score. | |
- **token_recall**(`float`): Token/Word recall score. | |
- **token_fscore**(`float`): Token/Word F-score. | |
- **type_precision**(`float`): Type/Lexicon precision score. | |
- **type_recall**(`float`): Type/Lexicon recall score. | |
- **type_fscore**(`float`): Type/Lexicon F-score score. | |
<!-- | |
#### Values from Popular Papers | |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.* | |
### Examples | |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.* | |
## Limitations and Bias | |
*Note any known limitations or biases that the metric has, with links and references if possible.* | |
## Citation | |
*Cite the source where this metric was introduced.* | |
## Further References | |
*Add any useful further references.* --> |