--- language: el tags: - legal library_name: transformers pipeline_tag: fill-mask widget: - text: Ο Δικηγόρος κατέθεσε ένα . --- # GreekLegalRoBERTa_v3 A Greek lagal version of RoBERTa pre-trained language model. ## Pre-training corpora The pre-training corpora of `GreekLegalRoBERTa_v3` include: * The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr). * the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf). * The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en). * the Greek Parliament Proceedings [Greekparl](https://proceedings.neurips.cc/paper_files/paper/2022/file/b96ce67b2f2d45e4ab315e13a6b5b9c5-Paper-Datasets_and_Benchmarks.pdf) . * The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων). * The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/). * The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org). * The [Raptarchis](https://raptarchis.gov.gr/). ## Pre-training details * We develop the code in [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers). We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa). * We released a model similar to the English `FacebookAI/roberta-base` for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters). * We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4. * We pretrained our models using 4 v-100 GPUs provided by [Cyprus Research Institute](https://www.cyi.ac.cy/index.php/research/research-centers.html). We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible. ## Requirements ``` pip install torch pip install tokenizers pip install transformers[torch] pip install datasets ``` ## Load Pretrained Model ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3") model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3") ``` ## Use Pretrained Model as a Language Model ```python import torch from transformers import * # Load model and tokenizer for i in range(10): tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3') lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3') unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek) # ================ EXAMPLE 1 ================ print("================ EXAMPLE 1 ================") text_1 = ' O Δικηγορος κατεθεσε ένα .' # EN: 'The lawyer submited a .' input_ids = tokenizer_greek.encode(text_1) outputs = lm_model_greek(torch.tensor([input_ids]))[0] for i in range(5): print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str']) #================ EXAMPLE 1 ================ #Model's answer 1 : letter #Model's answer 2 : copy #Model's answer 3 : record #Model's answer 4 : memorandum #Model's answer 5 : diagram # ================ EXAMPLE 2 ================ print("================ EXAMPLE 2 ================") text_2 = 'Είναι ένας άνθρωπος.' # EN: 'He is a person.' input_ids = tokenizer_greek.encode(text_2) outputs = lm_model_greek(torch.tensor([input_ids]))[0] for i in range(5): print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str']) #================ EXAMPLE 2 ================ #Model's answer 1 : new #Model's answer 2 : capable #Model's answer 3 : simple #Model's answer 4 : serious #Model's answer 5 : small # ================ EXAMPLE 3 ================ print("================ EXAMPLE 3 ================") text_3 = 'Είναι ένας άνθρωπος και κάνει συχνά .' # EN: 'He is a person he does frequently .' for i in range(5): print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str']) #================ EXAMPLE 3 ================ #Model's answer 1 : simple, trips #Model's answer 2 : new, vacations #Model's answer 3 : small, visits #Model's answer 4 : good, mistakes #Model's answer 5 : serious, actions # the most plausible prediction for the second is "trips" # ================ EXAMPLE 4 ================ print("================ EXAMPLE 4 ================") text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και .' # EN: '"Determining how to evaluate the diligence of employees attending edification and programs."' for i in range(5): print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str']) #================ EXAMPLE 4 ================ #Model's answer 1 : retraining #Model's answer 2 : specialization #Model's answer 3 : training #Model's answer 4 : education #Model's answer 5 : Retraining ``` ## Evaluation on downstream tasks For detailed results read the article: TODO ## Author