File size: 3,145 Bytes
64b2254
3b673f0
 
 
b2207c4
3b673f0
 
 
 
b2207c4
 
 
 
 
 
 
3b673f0
b2207c4
885d269
 
 
64b2254
b2207c4
 
 
 
3b673f0
b2207c4
a21227a
b2207c4
a21227a
b2207c4
a21227a
b2207c4
a21227a
b2207c4
a21227a
b2207c4
a21227a
b2207c4
a21227a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2207c4
 
 
 
 
 
a21227a
b2207c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a21227a
 
 
885d269
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
- pt
license: apache-2.0
tags:
- toxicity
- portuguese
- hate speech
- offensive language
- generated_from_trainer
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: dougtrajano/toxicity-target-classification
  results: []
datasets:
- dougtrajano/olid-br
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# dougtrajano/toxicity-target-classification

Toxicity Target Classification is a model that classifies if a given text is targeted or not.

This BERT model is a fine-tuned version of [neuralmind/bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) on the [OLID-BR dataset](https://huggingface.co/datasets/dougtrajano/olid-br).

## Overview

**Input:** Text in Brazilian Portuguese

**Output:** Binary classification (targeted or untargeted)

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("dougtrajano/toxicity-target-classification")

model = AutoModelForSequenceClassification.from_pretrained("dougtrajano/toxicity-target-classification")
```

## Limitations and bias

The following factors may degrade the model’s performance.

**Text Language**:  The model was trained on Brazilian Portuguese texts, so it may not work well with Portuguese dialects.

**Text Origin**: The model was trained on texts from social media and a few texts from other sources, so it may not work well on other types of texts.

## Trade-offs

Sometimes models exhibit performance issues under particular circumstances. In this section, we'll discuss situations in which you might discover that the model performs less than optimally, and should plan accordingly.

**Text Length**: The model was fine-tuned on texts with a word count between 1 and 178 words (average of 18 words). It may give poor results on texts with a word count outside this range.

## Performance

The model was evaluated on the test set of the [OLID-BR](https://dougtrajano.github.io/olid-br/) dataset.

**Accuracy:** 0.6864

**Precision:** 0.6882

**Recall:** 0.6864

**F1-Score:** 0.6872

| Class | Precision | Recall | F1-Score | Support |
| :---: | :-------: | :----: | :------: | :-----: |
| `UNTARGETED` | 0.4912 | 0.5011 | 0.4961 | 443 |
| `TARGETED INSULT` | 0.7759 | 0.7688 | 0.7723 | 995 |

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:

- learning_rate: 4.174021560583183e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 1993
- optimizer: Adam with betas=(0.9360294728287728,0.9974781444436187) and epsilon=8.016624612627008e-07
- lr_scheduler_type: linear
- num_epochs: 30
- label_smoothing_factor: 0.09936835309930625

### Framework versions

- Transformers 4.26.0
- Pytorch 1.10.2+cu113
- Datasets 2.9.0
- Tokenizers 0.13.2

## Provide Feedback

If you have any feedback on this model, please [open an issue](https://github.com/DougTrajano/ToChiquinho/issues/new) on GitHub.