README.md · CrabInHoney/1D-CNN-MC-toxicity-classifier-ru at main

metadata

license: mit
language:
  - ru
pipeline_tag: text-classification
tags:
  - safetensors
  - text-classification
  - tensorflow
  - russian
library_name: tf-keras
widget:
  - text: Мне нравится этот фильм!
    output:
      - label: POSITIVE
        score: 0.98
      - label: NEGATIVE
        score: 0.02
  - text: Какой же ты идиот..
    output:
      - label: POSITIVE
        score: 0.01
      - label: NEGATIVE
        score: 0.99
  - text: Паша, купи уже TURMS
    output:
      - label: POSITIVE
        score: 0.82
      - label: NEGATIVE
        score: 0.18
  - text: Дp пошtл ты, идиот
    output:
      - label: POSITIVE
        score: 0.01
      - label: NEGATIVE
        score: 0.99

1D-CNN-MC-toxicity-classifier-ru

(One-Dimensional Convolutional Neural Network with Multi-Channel input)

Architectural visualization:

Total parameters: 503249

Test Accuracy: 94.44%

Training Accuracy: 97.46%

This model is developed for binary classification of Cyrillic text.

A dataset of 75093 negative rows and 75093 positive rows was used for training.

Recommended length of the input sequence: 25 - 400 Cyrillic characters.

Simplifications of the dataset strings:

Removing extra spaces.

Replacing capital letters with small letters. (Я -> я).

Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc)

Replacing ё with e.

Example of use:

import numpy as np
from tensorflow import keras
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from safetensors.numpy import load_file
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os
import re
# Название папки, где хранится модель
model_dir = 'model'
max_len = 400
# Загрузка архитектуры модели
with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file:
    model_json = json_file.read()
model = keras.models.model_from_json(model_json)
# Загрузка весов из safetensors
state_dict = load_file(os.path.join(model_dir, 'tf_model.safetensors'))
weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))]
model.set_weights(weights)
# Загрузка токенизатора
with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f:
    tokenizer_json = f.read()
tokenizer = tokenizer_from_json(tokenizer_json)
def predict_toxicity(text):
    sequences = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
    probability = model.predict(padded)[0][0]
    class_label = "toxic" if probability >= 0.5 else "normal"
    return class_label, probability
# Пример использования
text = "Да какой идиот сделал эту НС?"
class_label, probability = predict_toxicity(text)
print(f"Text: {text}")
print(f"Class: {class_label} ({probability:.2%})")

Output:

Text: Да какой идиот сделал эту НС? Class: toxic (99.35%)