Spaces:

phonemetransformers
/

segmentation_scores

Runtime error

App Files Files Community

codebyzeb commited on Apr 20, 2023

Commit

1d59f5a

•

1 Parent(s): 5febdea

Implement metric

Browse files

Files changed (3) hide show

README.md +48 -12
segmentation_scores.py +258 -25
tests.py +6 -11

README.md CHANGED Viewed

@@ -5,35 +5,71 @@ datasets:
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
-# Metric Card for segmentation_scores
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
@@ -47,4 +83,4 @@ pinned: false
 *Cite the source where this metric was introduced.*
 ## Further References
-*Add any useful further references.*

 tags:
 - evaluate
 - metric
+language:
+- en
+description: " metric for word segmentation scores "
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 ---
+# Metric Card for Segmentation Scores
 ## Metric Description
+There are several standard metrics for evaluating word segmentation performance. Given a segmented text, we can evaluate against a gold standard according to the placement of the *boundaries*, the set of word *tokens* produced, and the set of word *types* produced. For each of these, we can compute *precision*, *recall* and *F-score*. In the literature, type and token scores are also referred to as *word* and *lexicon* scores, respectively.
+For example, if our gold segmentation is "the dog is on the boat", we have 5 word boundaries (7 if you include the edge boundaries), 6 word tokens and 5 word types. If a model predicted the segmentation "thedog is on the boat", this would differ from the gold segmentation in terms of 1 boundary (1 boundary missing), 3 word tokens ("the" and "dog" missing, "thedog" added) and 2 word types ("dog" missing and "thedog" added). For this example, we'd have a *boundary precision* of 1.0 (no incorrect boundaries), a *boundary recall* of 0.8 (4 boundaries hit out of 5) and a *boundary f-score* of 0.89 (harmonic mean of precision and recall). The full list of scores would be:
+| Score         | Value     |
+|--------------|-----------|
+| Boundary Precision | 1.0      |
+| Boundary Recall      | 0.8  |
+| Boundary F-Score      | 0.89  |
+| Token Precision | 0.8      |
+| Token Recall      | 0.67  |
+| Token F-Score      | 0.73  |
+| Type Precision | 0.8      |
+| Type Recall      | 0.8  |
+| Type F-Score      | 0.8  |
+Generally, type scores < token scores < boundary scores. This module also computes boundary scores that include the edge boundary, labeled *boundary_all* with the boundary scores excluding the edge labeled as *boundary_noedge*. If multiple sentences are provided, the measures are computed over all of them (the lexicon is computed over all sentences, rather than per-sentence).
 ## How to Use
+At minimum, this metric requires predictions and references as inputs.
+```python
+ >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
+ >>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
+ >>> print(results)
+ {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
+```
 ### Inputs
+- **predictions** (`list` of `str`): Predicted segmentations, with characters separated with spaces and word boundaries marked with ";eword".
+- **references** (`list` of `str`): Ground truth segmentations, with characters separated with spaces and word boundaries marked with ";eword".
 ### Output Values
+All scores have a minimum possible value of 0 and a maximum possible value of 1.0. A higher score is better. F-scores are the harmonic mean of precision and accuracy.
+- **boundary_all_precision**(`float`): Boundary precision score, including edge boundaries.
+- **boundary_all_recall**(`float`): Boundary recall score, including edge boundaries.
+- **boundary_all_fscore**(`float`): Boundary F-score score, including edge boundaries.
+- **boundary_noedge_precision**(`float`): Boundary precision score, excluding edge boundaries.
+- **boundary_noedge_recall**(`float`): Boundary recall score, excluding edge boundaries.
+- **boundary_noedge_fscore**(`float`): Boundary F-score score, excluding edge boundaries.
+- **token_precision**(`float`): Token/Word precision score.
+- **token_recall**(`float`): Token/Word recall score.
+- **token_fscore**(`float`): Token/Word F-score.
+- **type_precision**(`float`): Type/Lexicon precision score.
+- **type_recall**(`float`): Type/Lexicon recall score.
+- **type_fscore**(`float`): Type/Lexicon F-score score.
+<!--
 #### Values from Popular Papers
 *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 *Cite the source where this metric was introduced.*
 ## Further References
+*Add any useful further references.* -->

segmentation_scores.py CHANGED Viewed

@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""TODO: Add a description here."""
 import evaluate
 import datasets
@@ -28,33 +28,152 @@ year={2020}
 # TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
 # TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
-Calculates how good are predictions given some references, using certain scores
 Args:
-    predictions: list of predictions to score. Each predictions
-        should be a string with tokens separated by spaces.
-    references: list of reference for each prediction. Each
-        reference should be a string with tokens separated by spaces.
 Returns:
-    accuracy: description of the first score,
-    another_score: description of the second score,
 Examples:
-    Examples should be written in doctest format, and should illustrate how
-    to use the function.
-    >>> my_new_module = evaluate.load("my_new_module")
-    >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
     >>> print(results)
-    {'accuracy': 1.0}
 """
-# TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
@@ -71,13 +190,13 @@ class segmentation_scores(evaluate.Metric):
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=datasets.Features({
-                'predictions': datasets.Value('int64'),
-                'references': datasets.Value('int64'),
             }),
             # Homepage of the module for documentation
-            homepage="http://module.homepage",
             # Additional links to the codebase or references
-            codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
             reference_urls=["http://path.to.reference.url/new_module"]
         )
@@ -86,10 +205,124 @@ class segmentation_scores(evaluate.Metric):
         # TODO: Download external resources if needed
         pass
     def _compute(self, predictions, references):
-        """Returns the scores"""
-        # TODO: Compute the different scores of the module
-        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
         return {
-            "accuracy": accuracy,
         }

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+""" Segmentation scores evaluation metrics"""
 import evaluate
 import datasets
 # TODO: Add description of the module here
 _DESCRIPTION = """\
+This module computes segmentation scores for a list of predicted segmentations and gold segmentations.
 """
 # TODO: Add description of the arguments of the module here
 _KWARGS_DESCRIPTION = """
+Calculates how good are predicted segmentations, using boundary, token and type scores.
 Args:
+    predictions: list of segmented utterances to score. Each predictions
+        should be a string with phonemes separated by spaces and estimated word boundaries
+        denoted by the token ';eword'.
+    references: list of segmented utterances to score. Each predictions
+        should be a string with phonemes separated by spaces and gold word boundaries
+        denoted by the token ';eword'.
 Returns:
+    type_fscore: lexicon f1 score
+    type_precision: lexicon precision
+    type_recall: lexicon recall
+    token_fscore: token f1 score
+    token_precision: token precision
+    token_recall: token recall
+    boundary_all_fscore: boundary f1 score, including utterance boundaries
+    boundary_all_precision: boundary precision, including utterance boundaries
+    boundary_all_recall: boundary recall, including utterance boundaries
+    boundary_noedge_fscore: boundary f1 score, excluding utterance boundaries
+    boundary_noedge_precision: boundary precision, excluding utterance boundaries
+    boundary_noedge_recall: boundary recall, excluding utterance boundaries
 Examples:
+    >>> segmentation_scores = evaluate.load("transformersegmentation/segmentation_scores")
+    >>> results = segmentation_scores.compute(references=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"], predictions=["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"])
     >>> print(results)
+    {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
 """
+class TokenEvaluation(object):
+    """Evaluation of token f-score, precision and recall"""
+    def __init__(self):
+        self.test = 0
+        self.gold = 0
+        self.correct = 0
+        self.n = 0
+        self.n_exactmatch = 0
+    def precision(self):
+        return float(self.correct) / self.test if self.test != 0 else None
+    def recall(self):
+        return float(self.correct) / self.gold if self.gold != 0 else None
+    def fscore(self):
+        total = self.test + self.gold
+        return float(2 * self.correct) / total if total != 0 else None
+    def exact_match(self):
+        return float(self.n_exactmatch) / self.n if self.n else None
+    def update(self, test_set, gold_set):
+        self.n += 1
+        if test_set == gold_set:
+            self.n_exactmatch += 1
+        # omit empty items for type scoring (should not affect token
+        # scoring). Type lists are prepared with '_' where there is no
+        # match, to keep list lengths the same
+        self.test += len([x for x in test_set if x != "_"])
+        self.gold += len([x for x in gold_set if x != "_"])
+        self.correct += len(test_set & gold_set)
+    def update_lists(self, test_sets, gold_sets):
+        if len(test_sets) != len(gold_sets):
+            raise ValueError(
+                "#words different in test and gold: {} != {}".format(
+                    len(test_sets), len(gold_sets)
+                )
+            )
+        for t, g in zip(test_sets, gold_sets):
+            self.update(t, g)
+class TypeEvaluation(TokenEvaluation):
+    """Evaluation of type f-score, precision and recall"""
+    @staticmethod
+    def lexicon_check(textlex, goldlex):
+        """Compare hypothesis and gold lexicons"""
+        textlist = []
+        goldlist = []
+        for w in textlex:
+            if w in goldlex:
+                # set up matching lists for the true positives
+                textlist.append(w)
+                goldlist.append(w)
+            else:
+                # false positives
+                textlist.append(w)
+                # ensure matching null element in text list
+                goldlist.append("_")
+        for w in goldlex:
+            if w not in goldlist:
+                # now for the false negatives
+                goldlist.append(w)
+                # ensure matching null element in text list
+                textlist.append("_")
+        textset = [{w} for w in textlist]
+        goldset = [{w} for w in goldlist]
+        return textset, goldset
+    def update_lists(self, text, gold):
+        lt, lg = self.lexicon_check(text, gold)
+        super(TypeEvaluation, self).update_lists(lt, lg)
+class BoundaryEvaluation(TokenEvaluation):
+    @staticmethod
+    def get_boundary_positions(stringpos):
+        return [{idx for pair in line for idx in pair} for line in stringpos]
+    def update_lists(self, text, gold):
+        lt = self.get_boundary_positions(text)
+        lg = self.get_boundary_positions(gold)
+        super(BoundaryEvaluation, self).update_lists(lt, lg)
+class BoundaryNoEdgeEvaluation(BoundaryEvaluation):
+    @staticmethod
+    def get_boundary_positions(stringpos):
+        return [{left for left, _ in line if left > 0} for line in stringpos]
+class _StringPos(object):
+    """Compute start and stop index of words in an utterance"""
+    def __init__(self):
+        self.idx = 0
+    def __call__(self, n):
+        """Return the position of the current word given its length `n`"""
+        start = self.idx
+        self.idx += n
+        return start, self.idx
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
             features=datasets.Features({
+                'predictions': datasets.Value('string'),
+                'references': datasets.Value('string'),
             }),
             # Homepage of the module for documentation
+            homepage="https://huggingface.co/spaces/transformersegmentation/segmentation_scores",
             # Additional links to the codebase or references
+            codebase_urls=["http://github.com/codebyzeb/transformersegmentation"],
             reference_urls=["http://path.to.reference.url/new_module"]
         )
         # TODO: Download external resources if needed
         pass
+    def _process_data(self, text):
+        """ Load text data for evaluation
+        Parameters
+        ----------
+        text : list of str
+            The list of utterances to read for the evaluation.
+        Returns
+        -------
+        (words, positions, lexicon) : three lists
+            where `words` are the input utterances with word separators
+            removed, `positions` stores the start/stop index of each word
+            for each utterance, and `lexicon` is the list of words.
+        """
+        words = []
+        positions = []
+        lexicon = {}
+        # ignore empty lines
+        for utt in (utt for utt in text if utt.strip()):
+            # list of phones in the utterance with word seperator removed
+            phone_in_utterance = [
+                phone for phone in utt.split(" ") if phone != ";eword"
+            ]
+            words_in_utterance = (
+                "".join(
+                    " " if phone == ";eword" else phone for phone in utt.split(" ")
+                )
+                .strip()
+                .split(" ")
+            )
+            words.append(phone_in_utterance)
+            for word in words_in_utterance:
+                lexicon[word] = 1
+            idx = _StringPos()
+            positions.append({idx(len(word)) for word in words_in_utterance})
+        # return the words lexicon as a sorted list
+        lexicon = sorted([k for k in lexicon.keys()])
+        return words, positions, lexicon
     def _compute(self, predictions, references):
+        """Scores a segmented text against its gold version
+        Parameters
+        ----------
+        predictions : sequence of str
+            A suite of word utterances, each string using ';eword' as as word separator.
+        references : sequence of str
+            A suite of word utterances, each string using ';eword' as as word separator.
+        Returns
+        -------
+        scores : dict
+            A dictionary with the following entries:
+            * 'type_fscore'
+            * 'type_precision'
+            * 'type_recall'
+            * 'token_fscore'
+            * 'token_precision'
+            * 'token_recall'
+            * 'boundary_all_fscore'
+            * 'boundary_all_precision'
+            * 'boundary_all_recall'
+            * 'boundary_noedge_fscore'
+            * 'boundary_noedge_precision'
+            * 'boundary_noedge_recall'
+        Raises
+        ------
+        ValueError
+            If `gold` and `text` have different size or differ in tokens
+        """
+        text_words, text_stringpos, text_lex = self._process_data(predictions)
+        gold_words, gold_stringpos, gold_lex = self._process_data(references)
+        if len(gold_words) != len(text_words):
+            raise ValueError(
+                "gold and train have different size: len(gold)={}, len(train)={}".format(
+                    len(gold_words), len(text_words)
+                )
+            )
+        for i, (g, t) in enumerate(zip(gold_words, text_words)):
+            if g != t:
+                raise ValueError(
+                    'gold and train differ at line {}: gold="{}", train="{}"'.format(
+                        i + 1, g, t
+                    )
+                )
+        # token evaluation
+        token_eval = TokenEvaluation()
+        token_eval.update_lists(text_stringpos, gold_stringpos)
+        # type evaluation
+        type_eval = TypeEvaluation()
+        type_eval.update_lists(text_lex, gold_lex)
+        # boundary evaluation (with edges)
+        boundary_eval = BoundaryEvaluation()
+        boundary_eval.update_lists(text_stringpos, gold_stringpos)
+        # boundary evaluation (no edges)
+        boundary_noedge_eval = BoundaryNoEdgeEvaluation()
+        boundary_noedge_eval.update_lists(text_stringpos, gold_stringpos)
         return {
+            "token_precision": token_eval.precision(),
+            "token_recall": token_eval.recall(),
+            "token_fscore": token_eval.fscore(),
+            "type_precision": type_eval.precision(),
+            "type_recall": type_eval.recall(),
+            "type_fscore": type_eval.fscore(),
+            "boundary_all_precision": boundary_eval.precision(),
+            "boundary_all_recall": boundary_eval.recall(),
+            "boundary_all_fscore": boundary_eval.fscore(),
+            "boundary_noedge_precision": boundary_noedge_eval.precision(),
+            "boundary_noedge_recall": boundary_noedge_eval.recall(),
+            "boundary_noedge_fscore": boundary_noedge_eval.fscore(),
         }

tests.py CHANGED Viewed

@@ -1,17 +1,12 @@
 test_cases = [
     {
-        "predictions": [0, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0}
     },
     {
-        "predictions": [1, 1],
-        "references": [1, 1],
-        "result": {"metric_score": 1}
-    },
-    {
-        "predictions": [1, 0],
-        "references": [1, 1],
-        "result": {"metric_score": 0.5}
     }
 ]

 test_cases = [
     {
+        "predictions": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
+        "references": ["w ɛ ɹ ;eword ɪ z ;eword ð ɪ s ;eword", "l ɪ ɾ əl ;eword aɪ z ;eword"],
+        "result": {'type_fscore': 1.0, 'type_precision': 1.0, 'type_recall': 1.0, 'token_fscore': 1.0, 'token_precision': 1.0, 'token_recall': 1.0, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 1.0, 'boundary_noedge_fscore': 1.0, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 1.0}
     },
     {
+        "predictions": ["thedog is in the boat"],
+        "references": ["the dog is in the boat"],
+        "result": {'type_fscore': 0.8, 'type_precision': 0.8, 'type_recall': 0.8, 'token_fscore': 0.73, 'token_precision': 0.8, 'token_recall': 0.67, 'boundary_all_fscore': 1.0, 'boundary_all_precision': 1.0, 'boundary_all_recall': 0.94, 'boundary_noedge_fscore': 0.89, 'boundary_noedge_precision': 1.0, 'boundary_noedge_recall': 0.8}
     }
 ]