segmentation_scores / README.md
codebyzeb's picture
Update README.md
41253a5 verified
|
raw
history blame
5.11 kB
---
title: segmentation_scores
datasets:
- transformersegmentation/CHILDES_EnglishNA
tags:
- evaluate
- metric
language:
- en
description: ' metric for word segmentation scores '
sdk: gradio
sdk_version: 4.38.1
app_file: app.py
pinned: false
---
# Metric Card for Segmentation Scores
## Metric Description
There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively.
For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be:
| Score | Value |
|--------------|-----------|
| Boundary Precision | 1.0 |
| Boundary Recall | 0.8 |
| Boundary F-Score | 0.89 |
| Token Precision | 0.8 |
| Token Recall | 0.67 |
| Token F-Score | 0.73 |
| Type Precision | 0.8 |
| Type Recall | 0.8 |
| Type F-Score | 0.8 |
Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).
## How to Use
At minimum, this metric requires predictions and references as inputs.
```python
>>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
>>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"])
>>> print(results)
{'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
```
### Inputs
- **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".
- **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".
### Output Values
All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.
- **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries.
- **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries.
- **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries.
- **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries.
- **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries.
- **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries.
- **token_precision**(`float`): Token/Word precision score.
- **token_recall**(`float`): Token/Word recall score.
- **token_fscore**(`float`): Token/Word F-score.
- **type_precision**(`float`): Type/Lexicon precision score.
- **type_recall**(`float`): Type/Lexicon recall score.
- **type_fscore**(`float`): Type/Lexicon F-score score.
<!--
#### Values from Popular Papers
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
### Examples
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
## Limitations and Bias
*Note any known limitations or biases that the metric has, with links and references if possible.*
## Citation
*Cite the source where this metric was introduced.*
## Further References
*Add any useful further references.* -->