--- license: mit language: - ru pipeline_tag: text-classification tags: - safetensors - text-classification - tensorflow - russian library_name: tf-keras widget: - text: Мне нравится этот фильм! output: - label: POSITIVE score: 0.98 - label: NEGATIVE score: 0.02 - text: Какой же ты идиот.. output: - label: POSITIVE score: 0.01 - label: NEGATIVE score: 0.99 - text: Паша, купи уже TURMS output: - label: POSITIVE score: 0.82 - label: NEGATIVE score: 0.18 - text: Дp пошtл ты, идиот output: - label: POSITIVE score: 0.01 - label: NEGATIVE score: 0.99 --- #### 1D-CNN-MC-toxicity-classifier-ru (One-Dimensional Convolutional Neural Network with Multi-Channel input) Architectural visualization: ![](https://i.imgur.com/skbLM6w.png) Total parameters: 503249 ##### Test Accuracy: 94.44% ##### Training Accuracy: 97.46% This model is developed for binary classification of Cyrillic text. ##### A dataset of 75093 negative rows and 75093 positive rows was used for training. ##### Recommended length of the input sequence: 25 - 400 Cyrillic characters. ##### Simplifications of the dataset strings: Removing extra spaces. Replacing capital letters with small letters. (Я -> я). Removing any non-Cyrillic characters, including prefixes. (Remove: z, !, ., #, 4, &... etc) Replacing ё with e. ### Example of use: import numpy as np from tensorflow import keras from tensorflow.keras.preprocessing.text import tokenizer_from_json from safetensors.numpy import load_file from tensorflow.keras.preprocessing.sequence import pad_sequences import os import re # Название папки, где хранится модель model_dir = 'model' max_len = 400 # Загрузка архитектуры модели with open(os.path.join(model_dir, 'model_architecture.json'), 'r', encoding='utf-8') as json_file: model_json = json_file.read() model = keras.models.model_from_json(model_json) # Загрузка весов из safetensors state_dict = load_file(os.path.join(model_dir, 'tf_model.safetensors')) weights = [state_dict[f'weight_{i}'] for i in range(len(state_dict))] model.set_weights(weights) # Загрузка токенизатора with open(os.path.join(model_dir, 'tokenizer.json'), 'r', encoding='utf-8') as f: tokenizer_json = f.read() tokenizer = tokenizer_from_json(tokenizer_json) def predict_toxicity(text): sequences = tokenizer.texts_to_sequences([text]) padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post') probability = model.predict(padded)[0][0] class_label = "toxic" if probability >= 0.5 else "normal" return class_label, probability # Пример использования text = "Да какой идиот сделал эту НС?" class_label, probability = predict_toxicity(text) print(f"Text: {text}") print(f"Class: {class_label} ({probability:.2%})") ###### Output: Text: Да какой идиот сделал эту НС? Class: toxic (99.35%)