File size: 2,306 Bytes
b1e85b1
4719f52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1e85b1
4719f52
b1e85b1
4719f52
b1e85b1
c278f53
b1e85b1
4719f52
b1e85b1
4719f52
507e934
 
b1e85b1
4719f52
b1e85b1
4719f52
b1e85b1
4719f52
b1e85b1
4719f52
 
40f1c22
 
 
 
 
b1e85b1
4719f52
b1e85b1
4719f52
 
b1e85b1
4719f52
 
507e934
 
c278f53
 
 
 
507e934
 
 
 
 
c278f53
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: cc-by-nc-4.0
language:
- hu
metrics:
- accuracy
- f1
model-index:
- name: Hun_RoBERTa_large_Plain
  results:
  - task:
      type: text-classification
    metrics:
      - type: accuracy
        value: 0.79
      - type: f1
        value: 0.79
widget:
- text: "A tanúsítvány meghatározott adatainak a 2008/118/EK irányelv IV. fejezete szerinti szállításához szükséges adminisztratív okmányban..."
  example_title: "Incomprehensible"
- text: "Az AEO-engedély birtokosainak listáján – keresésre – megjelenő információk: az engedélyes neve, az engedélyt kibocsátó ország..."
  example_title: "Comprehensible"

---

## Model description

Cased fine-tuned XLM-RoBERTa-large model for Hungarian, trained on a dataset (~13k sentences) provided by National Tax and Customs Administration - Hungary (NAV): Public Accessibilty Programme.

## Intended uses & limitations

The model is designed to classify sentences as either "comprehensible" or "not comprehensible" (according to Plain Language guidelines):
* **Label_0** - "comprehensible" - The sentence is in Plain Language.
* **Label_1** - "not comprehensible" - The sentence is **not** in Plain Language.

## Training

Fine-tuned version of the original `xlm-roberta-large` model, trained on a dataset of Hungarian legal and administrative texts.

## Eval results

| Class | Precision | Recall | F-Score |
| ----- | --------- | ------ | ------- |
| **Comprehensible / Label_0** | **0.74** | **0.65** | **0.70** |
| **Not comprehensible / Label_1** | **0.71** | **0.79** | **0.74** |
| **accuracy** | | | **0.72** |
| **macro avg** | **0.73** | **0.72** | **0.72** |
| **weighted avg** | **0.72** | **0.72** | **0.72** |

## Usage

```py
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("uvegesistvan/Hun_RoBERTa_large_Plain")
model = AutoModelForSequenceClassification.from_pretrained("uvegesistvan/Hun_RoBERTa_large_Plain")
```

# Citation

Bibtex:
```bibtex
@PhDThesis{ Uveges:2024,
  author = {{"U}veges, Istv{\'a}n},
  title  = {K{\"o}z{\'e}rthet{\"o} és automatiz{\'a}ci{\'o} - k{\'i}s{\'e}rletek a jog, term{\'e}szetesnyelv-feldolgoz{\'a}s {\'e}s informatika hat{\'a}r{\'a}n.},
  year   = {2024},
  school = {Szegedi Tudom{\'a}nyegyetem}
}
```