metadata
license: mit
language:
- ru
pipeline_tag: text-classification
tags:
- safetensors
- text-classification
- tensorflow
- russian
library_name: tf-keras
widget:
- text: Мне нравится этот фильм!
output:
- label: POSITIVE
score: 0.98
- label: NEGATIVE
score: 0.02
- text: Какой же ты идиот..
output:
- label: POSITIVE
score: 0.01
- label: NEGATIVE
score: 0.99
- text: Паша, купи уже TURMS
output:
- label: POSITIVE
score: 0.82
- label: NEGATIVE
score: 0.18
- text: Дp пошtл ты, идиот
output:
- label: POSITIVE
score: 0.01
- label: NEGATIVE
score: 0.99
1D-CNN-MC-toxicity-classifier-ru
(One-Dimensional Convolutional Neural Network with Multi-Channel input)
Architectural visualization:
Total parameters: 503249
Test Accuracy: 94.44%
Training Accuracy: 97.46%
This model is developed for binary classification of Cyrillic text.
A dataset of 75093 negative rows and 75093 positive rows was used for training.
Recommended length of the input sequence: 25 - 400 Cyrillic characters.
Simplifications of the dataset strings:
Removing extra spaces.
Replacing capital letters with small letters. (Я -> я).
Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc)
Replacing ё with e.
Example of use:
import numpy as np
from tensorflow import keras
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from safetensors.numpy import load_file
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os
import re
# Название папки, где хранится модель
model_dir = 'model'
max_len = 400
# Загрузка архитектуры модели
with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file:
model_json = json_file.read()
model = keras.models.model_from_json(model_json)
# Загрузка весов из safetensors
state_dict = load_file(os.path.join(model_dir, 'tf_model.safetensors'))
weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))]
model.set_weights(weights)
# Загрузка токенизатора
with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f:
tokenizer_json = f.read()
tokenizer = tokenizer_from_json(tokenizer_json)
def predict_toxicity(text):
sequences = tokenizer.texts_to_sequences([text])
padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
probability = model.predict(padded)[0][0]
class_label = "toxic" if probability >= 0.5 else "normal"
return class_label, probability
# Пример использования
text = "Да какой идиот сделал эту НС?"
class_label, probability = predict_toxicity(text)
print(f"Text: {text}")
print(f"Class: {class_label} ({probability:.2%})")
Output:
Text: Да какой идиот сделал эту НС? Class: toxic (99.35%)