Spaces:

evaluate-measurement
/

text_duplicates

Build error

App Files Files Community

text_duplicates / README.md

lvwerra HF staff

Update Space (evaluate main: eb281894)

df6ae6f about 2 years ago

preview code

raw

history blame contribute delete

2.22 kB

	---
	title: Text Duplicates
	emoji: 🤗
	colorFrom: green
	colorTo: purple
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- measurement
	description: >-
	Returns the duplicate fraction of duplicate strings in the input.
	---

	# Measurement Card for Text Duplicates

	## Measurement Description

	The `text_duplicates` measurement returns the fraction of duplicated strings in the input data.

	## How to Use

	This measurement requires a list of strings as input:

	```python
	>>> data = ["hello sun","hello moon", "hello sun"]
	>>> duplicates = evaluate.load("text_duplicates")
	>>> results = duplicates.compute(data=data)
	```

	### Inputs
	- data (list of `str`): The input list of strings for which the duplicates are calculated.

	### Output Values
	- duplicate_fraction(`float`): the fraction of duplicates in the input string(s).
	- duplicates_dict(`list`): (optional) a list of tuples with the duplicate strings and the number of times they are repeated.

	By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (`duplicate_fraction`):
	)
	```python
	{'duplicate_fraction': 0.33333333333333337}
	```

	With the `list_duplicates=True` option, this measurement will also output a dictionary of tuples with duplicate strings and their counts.

	```python
	{'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2}}
	```

	Warning: the `list_duplicates=True` function can be memory-intensive for large datasets.

	### Examples

	Example with no duplicates

	```python
	>>> data = ["foo", "bar", "foobar"]
	>>> duplicates = evaluate.load("text_duplicates")
	>>> results = duplicates.compute(data=data)
	>>> print(results)
	{'duplicate_fraction': 0.0}
	```

	Example with multiple duplicates and `list_duplicates=True`:
	```python
	>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"]
	>>> duplicates = evaluate.load("text_duplicates")
	>>> results = duplicates.compute(data=data, list_duplicates=True)
	>>> print(results)
	{'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}}
	```

	## Citation(s)


	## Further References
	- [`hashlib` library](https://docs.python.org/3/library/hashlib.html)