---
language: el
tags:
- legal

library_name: transformers
pipeline_tag: fill-mask
widget:
- text: Ο Δικηγόρος κατέθεσε ένα <mask> .
---

# GreekLegalRoBERTa_v3

A Greek lagal version of RoBERTa pre-trained language model.


## Pre-training corpora

The pre-training corpora of `GreekLegalRoBERTa_v3` include:

* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr).
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf).
* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
* the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf) .
* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων).
* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/).
* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
* The [Raptarchis](https://raptarchis.gov.gr/).


## Pre-training details

* We develop the code in [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers). We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
* We released a model similar to the English `FacebookAI/roberta-base` for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters).
* We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
* We pretrained our models using 4 v-100 GPUs provided by [Cyprus Research Institute](https://www.cyi.ac.cy/index.php/research/research-centers.html). We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.


## Requirements


```
pip install torch
pip install tokenizers
pip install transformers[torch]
pip install datasets
```

## Load Pretrained Model 

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
```

## Use Pretrained Model as a Language Model

```python
import torch
from transformers import *

# Load model and tokenizer
for i in range(10):
  tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
  lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
# ================ EXAMPLE 1 ================
print("================ EXAMPLE 1 ================")
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
# EN: 'The lawyer submited a <mask>.'
input_ids = tokenizer_greek.encode(text_1)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
#================ EXAMPLE 1 ================
#Model's answer 1 : letter
#Model's answer 2 : copy
#Model's answer 3 : record
#Model's answer 4 : memorandum
#Model's answer 5 : diagram


# ================ EXAMPLE 2 ================
print("================ EXAMPLE 2 ================")

text_2 = 'Είναι ένας <mask> άνθρωπος.'
# EN: 'He is a <mask> person.'
input_ids = tokenizer_greek.encode(text_2)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])

#================ EXAMPLE 2 ================
#Model's answer 1 : new
#Model's answer 2 : capable
#Model's answer 3 : simple
#Model's answer 4 : serious
#Model's answer 5 : small


# ================ EXAMPLE 3 ================
print("================ EXAMPLE 3 ================")

text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
# EN: 'He is a <mask> person he does frequently <mask>.'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])

#================ EXAMPLE 3 ================
#Model's answer 1 : simple, trips
#Model's answer 2 : new, vacations
#Model's answer 3 : small, visits
#Model's answer 4 : good, mistakes
#Model's answer 5 : serious, actions

# the most plausible prediction for the second <mask> is "trips"
# ================ EXAMPLE 4 ================
print("================ EXAMPLE 4 ================")

text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])

#================ EXAMPLE 4 ================
#Model's answer 1 : retraining
#Model's answer 2 : specialization
#Model's answer 3 : training
#Model's answer 4 : education
#Model's answer 5 : Retraining

```

## Evaluation on downstream tasks

For detailed results read the article:

TODO


## Author