metadata

title: Text Duplicates
emoji: 🤗
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
  - evaluate
  - measurement
description: Returns the duplicate fraction of duplicate strings in the input.

Measurement Card for Text Duplicates

Measurement Description

The text_duplicates measurement returns the fraction of duplicated strings in the input data.

How to Use

This measurement requires a list of strings as input:

>>> data = ["hello sun","hello moon", "hello sun"]
>>> duplicates = evaluate.load("text_duplicates")
>>> results = duplicates.compute(data=data)

Inputs

data (list of str): The input list of strings for which the duplicates are calculated.

Output Values

duplicate_fraction(float): the fraction of duplicates in the input string(s).
duplicates_dict(list): (optional) a list of tuples with the duplicate strings and the number of times they are repeated.

By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (duplicate_fraction): )

{'duplicate_fraction': 0.33333333333333337}

With the list_duplicates=True option, this measurement will also output a dictionary of tuples with duplicate strings and their counts.

{'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2}}

Warning: the list_duplicates=True function can be memory-intensive for large datasets.

Examples

Example with no duplicates

>>> data = ["foo", "bar", "foobar"]
>>> duplicates = evaluate.load("text_duplicates")
>>> results = duplicates.compute(data=data)
>>> print(results)
{'duplicate_fraction': 0.0}

Example with multiple duplicates and list_duplicates=True:

>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"]
>>> duplicates = evaluate.load("text_duplicates")
>>> results = duplicates.compute(data=data, list_duplicates=True)
>>> print(results)
{'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}}

Citation(s)

Further References

hashlib library