Spaces:

evaluate-measurement
/

word_count

Build error

App Files Files Community

lvwerra HF staff commited on May 27, 2022

Commit

0804d15

•

1 Parent(s): 0756f70

Update Space (evaluate main: 1ead4793)

Browse files

Files changed (4) hide show

README.md +74 -6
app.py +6 -0
requirements.txt +3 -0
word_count.py +64 -0

README.md CHANGED Viewed

@@ -1,12 +1,80 @@
 ---
-title: Word_count
-emoji: 📈
-colorFrom: gray
-colorTo: pink
 sdk: gradio
-sdk_version: 3.0.6
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: Word Count
+emoji: 🤗
+colorFrom: green
+colorTo: purple
 sdk: gradio
+sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- measurement
 ---
+# Measurement Card for Word Count
+## Measurement Description
+The `word_count` measurement returns the total number of word count of the input string, using the sklearn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
+## How to Use
+This measurement requires a list of strings as input:
+```python
+>>> data = ["hello world and hello moon"]
+>>> wordcount= evaluate.load("word_count")
+>>> results = wordcount.compute(data=data)
+```
+### Inputs
+- **data** (list of `str`): The input list of strings for which the word length is calculated.
+- **max_vocab** (`int`): (optional) the top number of words to consider (can be specified if dataset is too large)
+### Output Values
+- **total_word_count** (`int`): the total number of words in the input string(s).
+- **unique_words** (`int`): the number of unique words in the input string(s).
+Output Example(s):
+```python
+{'total_word_count': 5, 'unique_words': 4}
+### Examples
+Example for a single string
+```python
+>>> data = ["hello sun and goodbye moon"]
+>>> wordcount = evaluate.load("word_count")
+>>> results = wordcount.compute(data=data)
+>>> print(results)
+{'total_word_count': 5, 'unique_words': 5}
+```
+Example for a multiple strings
+```python
+>>> data = ["hello sun and goodbye moon", "foo bar foo bar"]
+>>> wordcount = evaluate.load("word_count")
+>>> results = wordcount.compute(data=data)
+>>> print(results)
+{'total_word_count': 9, 'unique_words': 7}
+```
+Example for a dataset from 🤗 Datasets:
+```python
+>>> imdb = datasets.load_dataset('imdb', split = 'train')
+>>> wordcount = evaluate.load("word_count")
+>>> results = wordcount.compute(data=imdb['text'])
+>>> print(results)
+{'total_word_count': 5678573, 'unique_words': 74849}
+```
+## Citation(s)
+## Further References
+- [Sklearn `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("word_count", type="measurement")
+launch_gradio_widget(module)

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+git+https://github.com/huggingface/evaluate.git@main
+datasets~=2.0
+sklearn~=1.1.1

word_count.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import evaluate
+import datasets
+from sklearn.feature_extraction.text import CountVectorizer
+_DESCRIPTION = """
+Returns the total number of words, and the number of unique words in the input data.
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+    `data`: a list of `str` for which the words are counted.
+    `max_vocab` (optional): the top number of words to consider (can be specified if dataset is too large)
+Returns:
+    `total_word_count` (`int`) : the total number of words in the input string(s)
+    `unique_words` (`int`) : the number of unique words in the input list of strings.
+Examples:
+    >>> data = ["hello world and hello moon"]
+    >>> wordcount= evaluate.load("word_count")
+    >>> results = wordcount.compute(data=data)
+    >>> print(results)
+    {'total_word_count': 5, 'unique_words': 4}
+"""
+_CITATION = ""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class WordCount(evaluate.EvaluationModule):
+    """This measurement returns the total number of words and the number of unique words
+     in the input string(s)."""
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            # This is the description that will appear on the modules page.
+            module_type="measurement",
+            description=_DESCRIPTION,
+            citation = _CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features({
+                'data': datasets.Value('string'),
+            })
+        )
+    def _compute(self, data, max_vocab = None):
+        """Returns the number of unique words in the input data"""
+        count_vectorizer = CountVectorizer(max_features=max_vocab)
+        document_matrix = count_vectorizer.fit_transform(data)
+        word_count = document_matrix.sum()
+        unique_words = document_matrix.shape[1]
+        return {"total_word_count": word_count, "unique_words": unique_words}