|
--- |
|
license: apache-2.0 |
|
base_model: sentence-transformers/LaBSE |
|
tags: |
|
- generated_from_trainer |
|
- news |
|
- russian |
|
- media |
|
- text-classification |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
model-index: |
|
- name: frozen_news_classifier_ft |
|
results: [] |
|
datasets: |
|
- data-silence/rus_news_classifier |
|
pipeline_tag: text-classification |
|
language: |
|
- ru |
|
library_name: transformers |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Model description |
|
|
|
This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier). |
|
The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space. |
|
It should be noted that the model allows to classify news articles in other languages available in LaBSE, but the quality of such classification will be worse than Russian-language news texts. |
|
The learning news dataset is a well-balanced sample of recent news from the last five years. |
|
|
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.7314 |
|
- Accuracy: 0.7793 |
|
- F1: 0.7753 |
|
- Precision: 0.7785 |
|
- Recall: 0.7793 |
|
|
|
## How to use |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
universal_model_name = "data-silence/frozen_news_classifier_ft" |
|
universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name) |
|
universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name) |
|
|
|
# Перевод моделей в режим оценки и на нужное устройство |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
universal_model = universal_model.to(device) |
|
universal_model.eval() |
|
|
|
id2label = { |
|
0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss', |
|
5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel' |
|
} |
|
|
|
|
|
def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]: |
|
"""Получает эмбеддинги списка текстов""" |
|
# Токенизация входного текста |
|
inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device) |
|
with torch.no_grad(): |
|
outputs = universal_model.base_model(**inputs) |
|
embeddings = outputs.pooler_output |
|
embeddings = torch.nn.functional.normalize(embeddings, dim=1) |
|
return embeddings.tolist() |
|
|
|
|
|
def predict_category(news: list[str]) -> list[str]: |
|
"""Предсказывает категорию по тексту новости / новостей""" |
|
|
|
# Токенизация с активацией выравнивания и усечения |
|
inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True) |
|
# Получение логитов модели |
|
with torch.no_grad(): |
|
outputs = universal_model(**inputs) |
|
logits = outputs.logits |
|
|
|
# Получение индексов предсказанных меток |
|
predicted_labels = torch.argmax(logits, dim=-1).tolist() |
|
# Преобразование индексов в категории |
|
predicted_categories = [id2label[label] for label in predicted_labels] |
|
return predicted_categories |
|
|
|
``` |
|
|
|
|
|
|
|
## Intended uses & limitations |
|
|
|
Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics. |
|
|
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 500 |
|
- num_epochs: 10 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | |
|
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:| |
|
| 0.8422 | 1.0 | 3596 | 0.8104 | 0.7681 | 0.7632 | 0.7669 | 0.7681 | |
|
| 0.7923 | 2.0 | 7192 | 0.7738 | 0.7711 | 0.7666 | 0.7700 | 0.7711 | |
|
| 0.7597 | 3.0 | 10788 | 0.7485 | 0.7754 | 0.7716 | 0.7741 | 0.7754 | |
|
| 0.7564 | 4.0 | 14384 | 0.7314 | 0.7793 | 0.7753 | 0.7785 | 0.7793 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.42.4 |
|
- Pytorch 2.4.0+cu121 |
|
- Datasets 2.21.0 |
|
- Tokenizers 0.19.1 |
|
|