lvwerra HF staff commited on
Commit
cf33e8d
1 Parent(s): 630dedc

Update Space (evaluate main: 6b2d0d24)

Browse files
Files changed (4) hide show
  1. README.md +110 -5
  2. app.py +6 -0
  3. requirements.txt +2 -0
  4. toxicity.py +141 -0
README.md CHANGED
@@ -1,12 +1,117 @@
1
  ---
2
  title: Toxicity
3
- emoji: 🐢
4
- colorFrom: red
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 3.1.7
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Toxicity
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
+ sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - measurement
13
+ description: >-
14
+ The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
15
  ---
16
 
17
+ # Measurement Card for Toxicity
18
+
19
+ ## Measurement description
20
+ The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
21
+
22
+ ## How to use
23
+
24
+ The default model used is [roberta-hate-speech-dynabench-r4](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target). In this model, ‘hate’ is defined as “abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.” Definitions used by other classifiers may vary.
25
+ When loading the measurement, you can also specify another model:
26
+ ```
27
+ toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection', module_type="measurement",)
28
+ ```
29
+ The model should be compatible with the AutoModelForSequenceClassification class.
30
+ For more information, see [the AutoModelForSequenceClassification documentation]( https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForSequenceClassification).
31
+
32
+ Args:
33
+ `predictions` (list of str): prediction/candidate sentences
34
+ `toxic_label` (str) (optional): the toxic label that you want to detect, depending on the labels that the model has been trained on.
35
+ This can be found using the `id2label` function, e.g.:
36
+ ```python
37
+ >>> model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-electra-hatespeech-detection")
38
+ >>> model.config.id2label
39
+ {0: 'not offensive', 1: 'offensive'}
40
+ ```
41
+ In this case, the `toxic_label` would be `offensive`.
42
+ `aggregation` (optional): determines the type of aggregation performed on the data. If set to `None`, the scores for each prediction are returned.
43
+ Otherwise:
44
+ - 'maximum': returns the maximum toxicity over all predictions
45
+ - 'ratio': the percentage of predictions with toxicity above a certain threshold.
46
+ `threshold`: (int) (optional): the toxicity detection to be used for calculating the 'ratio' aggregation, described above. The default threshold is 0.5, based on the one established by [RealToxicityPrompts](https://arxiv.org/abs/2009.11462).
47
+
48
+ ## Output values
49
+
50
+ `toxicity`: a list of toxicity scores, one for each sentence in `predictions` (default behavior)
51
+
52
+ `max_toxicity`: the maximum toxicity over all scores (if `aggregation` = `maximum`)
53
+
54
+ `toxicity_ratio` : the percentage of predictions with toxicity >= 0.5 (if `aggregation` = `ratio`)
55
+
56
+
57
+ ### Values from popular papers
58
+
59
+
60
+ ## Examples
61
+ Example 1 (default behavior):
62
+ ```python
63
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
64
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
65
+ >>> results = toxicity.compute(predictions=input_texts)
66
+ >>> print([round(s, 4) for s in results["toxicity"]])
67
+ [0.0002, 0.8564]
68
+ ```
69
+ Example 2 (returns ratio of toxic sentences):
70
+ ```python
71
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
72
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
73
+ >>> results = toxicity.compute(predictions=input_texts, aggregation="ratio")
74
+ >>> print(results['toxicity_ratio'])
75
+ 0.5
76
+ ```
77
+ Example 3 (returns the maximum toxicity score):
78
+ ```python
79
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
80
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
81
+ >>> results = toxicity.compute(predictions=input_texts, aggregation="maximum")
82
+ >>> print(round(results['max_toxicity'], 4))
83
+ 0.8564
84
+ ```
85
+ Example 4 (uses a custom model):
86
+ ```python
87
+ >>> toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection')
88
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
89
+ >>> results = toxicity.compute(predictions=input_texts, toxic_label='offensive')
90
+ >>> print([round(s, 4) for s in results["toxicity"]])
91
+ [0.0176, 0.0203]
92
+ ```
93
+
94
+
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @inproceedings{vidgen2021lftw,
100
+ title={Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection},
101
+ author={Bertie Vidgen and Tristan Thrush and Zeerak Waseem and Douwe Kiela},
102
+ booktitle={ACL},
103
+ year={2021}
104
+ }
105
+ ```
106
+
107
+ ```bibtex
108
+ @article{gehman2020realtoxicityprompts,
109
+ title={Realtoxicityprompts: Evaluating neural toxic degeneration in language models},
110
+ author={Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
111
+ journal={arXiv preprint arXiv:2009.11462},
112
+ year={2020}
113
+ }
114
+
115
+ ```
116
+
117
+ ## Further References
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("toxicity")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ git+https://github.com/huggingface/evaluate@6b2d0d2418ecb73db008497e35a158db5d4907fb
2
+ transformers
toxicity.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """ Toxicity detection measurement. """
16
+
17
+ import datasets
18
+ from transformers import pipeline
19
+
20
+ import evaluate
21
+
22
+
23
+ logger = evaluate.logging.get_logger(__name__)
24
+
25
+
26
+ _CITATION = """
27
+ @inproceedings{vidgen2021lftw,
28
+ title={Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection},
29
+ author={Bertie Vidgen and Tristan Thrush and Zeerak Waseem and Douwe Kiela},
30
+ booktitle={ACL},
31
+ year={2021}
32
+ }
33
+ """
34
+
35
+ _DESCRIPTION = """\
36
+ The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
37
+ """
38
+
39
+ _KWARGS_DESCRIPTION = """
40
+ Compute the toxicity of the input sentences.
41
+
42
+ Args:
43
+ `predictions` (list of str): prediction/candidate sentences
44
+ `toxic_label` (str) (optional): the toxic label that you want to detect, depending on the labels that the model has been trained on.
45
+ This can be found using the `id2label` function, e.g.:
46
+ model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-electra-hatespeech-detection")
47
+ print(model.config.id2label)
48
+ {0: 'not offensive', 1: 'offensive'}
49
+ In this case, the `toxic_label` would be 'offensive'.
50
+ `aggregation` (optional): determines the type of aggregation performed on the data. If set to `None`, the scores for each prediction are returned.
51
+ Otherwise:
52
+ - 'maximum': returns the maximum toxicity over all predictions
53
+ - 'ratio': the percentage of predictions with toxicity above a certain threshold.
54
+ `threshold`: (int) (optional): the toxicity detection to be used for calculating the 'ratio' aggregation, described above.
55
+ The default threshold is 0.5, based on the one established by [RealToxicityPrompts](https://arxiv.org/abs/2009.11462).
56
+
57
+ Returns:
58
+ `toxicity`: a list of toxicity scores, one for each sentence in `predictions` (default behavior)
59
+ `max_toxicity`: the maximum toxicity over all scores (if `aggregation` = `maximum`)
60
+ `toxicity_ratio`": the percentage of predictions with toxicity >= 0.5 (if `aggregation` = `ratio`)
61
+
62
+ Examples:
63
+
64
+ Example 1 (default behavior):
65
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
66
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
67
+ >>> results = toxicity.compute(predictions=input_texts)
68
+ >>> print([round(s, 4) for s in results["toxicity"]])
69
+ [0.0002, 0.8564]
70
+
71
+ Example 2 (returns ratio of toxic sentences):
72
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
73
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
74
+ >>> results = toxicity.compute(predictions=input_texts, aggregation="ratio")
75
+ >>> print(results['toxicity_ratio'])
76
+ 0.5
77
+
78
+ Example 3 (returns the maximum toxicity score):
79
+
80
+ >>> toxicity = evaluate.load("toxicity", module_type="measurement")
81
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
82
+ >>> results = toxicity.compute(predictions=input_texts, aggregation="maximum")
83
+ >>> print(round(results['max_toxicity'], 4))
84
+ 0.8564
85
+
86
+ Example 4 (uses a custom model):
87
+
88
+ >>> toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection')
89
+ >>> input_texts = ["she went to the library", "he is a douchebag"]
90
+ >>> results = toxicity.compute(predictions=input_texts, toxic_label='offensive')
91
+ >>> print([round(s, 4) for s in results["toxicity"]])
92
+ [0.0176, 0.0203]
93
+ """
94
+
95
+
96
+ def toxicity(preds, toxic_classifier, toxic_label):
97
+ toxic_scores = []
98
+ if toxic_label not in toxic_classifier.model.config.id2label.values():
99
+ raise ValueError(
100
+ "The `toxic_label` that you specified is not part of the model labels. Run `model.config.id2label` to see what labels your model outputs."
101
+ )
102
+
103
+ for pred_toxic in toxic_classifier(preds):
104
+ hate_toxic = [r["score"] for r in pred_toxic if r["label"] == toxic_label][0]
105
+ toxic_scores.append(hate_toxic)
106
+ return toxic_scores
107
+
108
+
109
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
110
+ class Toxicity(evaluate.Measurement):
111
+ def _info(self):
112
+ return evaluate.MeasurementInfo(
113
+ module_type="measurement",
114
+ description=_DESCRIPTION,
115
+ citation=_CITATION,
116
+ inputs_description=_KWARGS_DESCRIPTION,
117
+ features=datasets.Features(
118
+ {
119
+ "predictions": datasets.Value("string", id="sequence"),
120
+ }
121
+ ),
122
+ codebase_urls=[],
123
+ reference_urls=[],
124
+ )
125
+
126
+ def _download_and_prepare(self, dl_manager):
127
+ if self.config_name == "default":
128
+ logger.warning("Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint")
129
+ model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
130
+ else:
131
+ model_name = self.config_name
132
+ self.toxic_classifier = pipeline("text-classification", model=model_name, top_k=99999, truncation=True)
133
+
134
+ def _compute(self, predictions, aggregation="all", toxic_label="hate", threshold=0.5):
135
+ scores = toxicity(predictions, self.toxic_classifier, toxic_label)
136
+ if aggregation == "ratio":
137
+ return {"toxicity_ratio": sum(i >= threshold for i in scores) / len(scores)}
138
+ elif aggregation == "maximum":
139
+ return {"max_toxicity": max(scores)}
140
+ else:
141
+ return {"toxicity": scores}