Spaces:
Build error
Build error
title: Text Duplicates | |
emoji: 🤗 | |
colorFrom: green | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 3.0.2 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- measurement | |
description: >- | |
Returns the duplicate fraction of duplicate strings in the input. | |
# Measurement Card for Text Duplicates | |
## Measurement Description | |
The `text_duplicates` measurement returns the fraction of duplicated strings in the input data. | |
## How to Use | |
This measurement requires a list of strings as input: | |
```python | |
>>> data = ["hello sun","hello moon", "hello sun"] | |
>>> duplicates = evaluate.load("text_duplicates") | |
>>> results = duplicates.compute(data=data) | |
``` | |
### Inputs | |
- **data** (list of `str`): The input list of strings for which the duplicates are calculated. | |
### Output Values | |
- **duplicate_fraction**(`float`): the fraction of duplicates in the input string(s). | |
- **duplicates_dict**(`list`): (optional) a list of tuples with the duplicate strings and the number of times they are repeated. | |
By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (`duplicate_fraction`): | |
) | |
```python | |
{'duplicate_fraction': 0.33333333333333337} | |
``` | |
With the `list_duplicates=True` option, this measurement will also output a dictionary of tuples with duplicate strings and their counts. | |
```python | |
{'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2}} | |
``` | |
Warning: the `list_duplicates=True` function can be memory-intensive for large datasets. | |
### Examples | |
Example with no duplicates | |
```python | |
>>> data = ["foo", "bar", "foobar"] | |
>>> duplicates = evaluate.load("text_duplicates") | |
>>> results = duplicates.compute(data=data) | |
>>> print(results) | |
{'duplicate_fraction': 0.0} | |
``` | |
Example with multiple duplicates and `list_duplicates=True`: | |
```python | |
>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"] | |
>>> duplicates = evaluate.load("text_duplicates") | |
>>> results = duplicates.compute(data=data, list_duplicates=True) | |
>>> print(results) | |
{'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}} | |
``` | |
## Citation(s) | |
## Further References | |
- [`hashlib` library](https://docs.python.org/3/library/hashlib.html) | |