metadata

language:
  - pt
license: apache-2.0
tags:
  - toxicity
  - portuguese
  - hate speech
  - offensive language
  - generated_from_trainer
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: dougtrajano/toxic-comment-classification
    results: []
datasets:
  - dougtrajano/olid-br
library_name: transformers

dougtrajano/toxic-comment-classification

Toxic Comment Classification is a model that detects if the text is toxic or not.

This BERT model is a fine-tuned version of neuralmind/bert-base-portuguese-cased on the OLID-BR dataset.

Overview

Input: Text in Brazilian Portuguese

Output: Binary classification (toxic or not toxic)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("dougtrajano/toxic-comment-classification")

model = AutoModelForSequenceClassification.from_pretrained("dougtrajano/toxic-comment-classification")

Limitations and bias

The following factors may degrade the model’s performance.

Text Language: The model was trained on Brazilian Portuguese texts, so it may not work well with Portuguese dialects.

Text Origin: The model was trained on texts from social media and a few texts from other sources, so it may not work well on other types of texts.

Trade-offs

Sometimes models exhibit performance issues under particular circumstances. In this section, we'll discuss situations in which you might discover that the model performs less than optimally, and should plan accordingly.

Text Length: The model was fine-tuned on texts with a word count between 1 and 178 words (average of 18 words). It may give poor results on texts with a word count outside this range.

Performance

The model was evaluated on the test set of the OLID-BR dataset.

Accuracy: 0.8578

Precision: 0.8594

Recall: 0.8578

F1-Score: 0.8580

Class	Precision	Recall	F1-Score	Support
`NOT-OFFENSIVE`	0.8886	0.8490	0.8683	1,775
`OFFENSIVE`	0.8233	0.8686	0.8453	1,438

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3.255788747459486e-05
train_batch_size: 8
eval_batch_size: 8
seed: 1993
optimizer: Adam with betas=(0.8445637934160373,0.8338816842140165) and epsilon=2.527092625455385e-08
lr_scheduler_type: linear
num_epochs: 30
label_smoothing_factor: 0.07158711257743958

Framework versions

Transformers 4.26.0
Pytorch 1.10.2+cu113
Datasets 2.9.0
Tokenizers 0.13.2

Provide Feedback

If you have any feedback on this model, please open an issue on GitHub.