Spaces:

phonemetransformers
/

segmentation_scores

Runtime error

App Files Files Community

segmentation_scores / README.md

codebyzeb

Update README.md

41253a5 verified 4 months ago

preview code

raw

history blame

5.11 kB

	---
	title: segmentation_scores
	datasets:
	- transformersegmentation/CHILDES_EnglishNA
	tags:
	- evaluate
	- metric
	language:
	- en
	description: ' metric for word segmentation scores '
	sdk: gradio
	sdk_version: 4.38.1
	app_file: app.py
	pinned: false
	---

	# Metric Card for Segmentation Scores

	## Metric Description

	There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the boundaries, the set of word tokens produced, and the set of word types produced. For each of these, we can compute precision, recall and F-score. In the literature, type and token scores are also referred to as word and lexicon scores, respectively.

	For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a boundary precision of 1.0 (no incorrect boundaries), a boundary recall of 0.8 (4 boundaries hit out of 5) and a boundary f-score of 0.89 (harmonic mean of precision and recall). The full list of scores would be:

	\| Score \| Value \|
	\|--------------\|-----------\|
	\| Boundary Precision \| 1.0 \|
	\| Boundary Recall \| 0.8 \|
	\| Boundary F-Score \| 0.89 \|
	\| Token Precision \| 0.8 \|
	\| Token Recall \| 0.67 \|
	\| Token F-Score \| 0.73 \|
	\| Type Precision \| 0.8 \|
	\| Type Recall \| 0.8 \|
	\| Type F-Score \| 0.8 \|

	Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled boundary_all with the boundary scores excluding the edge labeled as boundary_noedge. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).

	## How to Use

	At minimum, this metric requires predictions and references as inputs.

	```python
	>>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
	>>> results = segmentation_scores.compute(references=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"], predictions=["w ɛ ɹ WORD_BOUNDARY ɪ z WORD_BOUNDARY ð ɪ s WORD_BOUNDARY", "l ɪ ɾ əl WORD_BOUNDARY aɪ z WORD_BOUNDARY"])
	>>> print(results)
	{'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}

	```

	### Inputs
	- predictions (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".
	- references (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with "WORD_BOUNDARY".

	### Output Values

	All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.

	- boundary_all_precision(`float`): Boundary precision score, including edge boundaries.
	- boundary_all_recall(`float`): Boundary recall score, including edge boundaries.
	- boundary_all_fscore(`float`): Boundary F-score score, including edge boundaries.
	- boundary_noedge_precision(`float`): Boundary precision score, excluding edge boundaries.
	- boundary_noedge_recall(`float`): Boundary recall score, excluding edge boundaries.
	- boundary_noedge_fscore(`float`): Boundary F-score score, excluding edge boundaries.
	- token_precision(`float`): Token/Word precision score.
	- token_recall(`float`): Token/Word recall score.
	- token_fscore(`float`): Token/Word F-score.
	- type_precision(`float`): Type/Lexicon precision score.
	- type_recall(`float`): Type/Lexicon recall score.
	- type_fscore(`float`): Type/Lexicon F-score score.

	<!--
	#### Values from Popular Papers
	Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.

	### Examples
	Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.

	## Limitations and Bias
	Note any known limitations or biases that the metric has, with links and references if possible.

	## Citation
	Cite the source where this metric was introduced.

	## Further References
	Add any useful further references. -->