File size: 4,816 Bytes
88b135d
 
 
 
 
92be7cb
 
 
 
88b135d
 
 
 
 
 
 
 
92be7cb
 
 
 
 
 
88b135d
 
 
 
 
22d78cd
 
92be7cb
c0bbe77
 
92be7cb
 
88b135d
 
 
 
 
 
 
92be7cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88b135d
92be7cb
88b135d
92be7cb
88b135d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
base_model: sentence-transformers/LaBSE
tags:
- generated_from_trainer
- news
- russian
- media
- text-classification
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: frozen_news_classifier_ft
  results: []
datasets:
- data-silence/rus_news_classifier
pipeline_tag: text-classification
language:
- ru
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Model description

This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space. 
It should be noted that the model allows to classify news articles in other languages available in LaBSE, but the quality of such classification will be worse than Russian-language news texts.
The learning news dataset is a well-balanced sample of recent news from the last five years. 

It achieves the following results on the evaluation set:
- Loss: 0.7314
- Accuracy: 0.7793
- F1: 0.7753
- Precision: 0.7785
- Recall: 0.7793

## How to use

```python

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

universal_model_name = "data-silence/frozen_news_classifier_ft"
universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)

# Перевод моделей в режим оценки и на нужное устройство
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
universal_model = universal_model.to(device)
universal_model.eval()

id2label = {
    0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
    5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
}


def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
    """Получает эмбеддинги списка текстов"""
    # Токенизация входного текста
    inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = universal_model.base_model(**inputs)
    embeddings = outputs.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings, dim=1)
    return embeddings.tolist()


def predict_category(news: list[str]) -> list[str]:
    """Предсказывает категорию по тексту новости / новостей"""

    # Токенизация с активацией выравнивания и усечения
    inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
    # Получение логитов модели
    with torch.no_grad():
        outputs = universal_model(**inputs)
        logits = outputs.logits

    # Получение индексов предсказанных меток
    predicted_labels = torch.argmax(logits, dim=-1).tolist()
    # Преобразование индексов в категории
    predicted_categories = [id2label[label] for label in predicted_labels]
    return predicted_categories

```



## Intended uses & limitations

Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Accuracy | F1     | Precision | Recall |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
| 0.8422        | 1.0   | 3596  | 0.8104          | 0.7681   | 0.7632 | 0.7669    | 0.7681 |
| 0.7923        | 2.0   | 7192  | 0.7738          | 0.7711   | 0.7666 | 0.7700    | 0.7711 |
| 0.7597        | 3.0   | 10788 | 0.7485          | 0.7754   | 0.7716 | 0.7741    | 0.7754 |
| 0.7564        | 4.0   | 14384 | 0.7314          | 0.7793   | 0.7753 | 0.7785    | 0.7793 |


### Framework versions

- Transformers 4.42.4
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1