File size: 8,217 Bytes

# Legal-HeBERT
Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of [HeBERT](https://github.com/avichaychriqui/HeBERT) applied on legal and legislative documents. The second version uses [HeBERT](https://github.com/avichaychriqui/HeBERT)'s architecture guidlines to train a BERT model from scratch. <br>
We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.

## Training Data
Our training datasets are:

| Name                                                                                                                              	| Hebrew Description                                                       	| Size (GB) 	| Documents 	| Sentences  	| Words       	| Notes                                   	|
|-----------------------------------------------------------------------------------------------------------------------------------	|--------------------------------------------------------------------------	|-----------	|-----------	|------------	|-------------	|-----------------------------------------	|
| The Israeli Law Book                                                                                                              	| ספר החוקים הישראלי                                                       	| 0.05      	| 2338      	| 293352     	| 4851063     	|                                         	|
| Judgments of the Supreme Court                                                                                                    	| מאגר פסקי   הדין של בית המשפט העליון                                     	| 0.7       	| 212348    	| 5790138    	| 79672415    	|                                         	|
| custody courts                                                                                                                    	| החלטות בתי   הדין למשמורת                                                	| 2.46      	| 169,708   	| 8,555,893  	| 213,050,492 	|                                         	|
| Law memoranda, drafts of secondary legislation and drafts of   support tests that have been distributed to the public for comment 	| תזכירי חוק,   טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור 	| 0.4       	| 3,291     	| 294,752    	| 7,218,960   	|                                         	|
| Supervisors of Land Registration judgments                                                                                        	| מאגר פסקי   דין של המפקחים על רישום המקרקעין                             	| 0.02      	| 559       	| 67,639     	| 1,785,446   	|                                         	|
| Decisions of the Labor Court - Corona                                                                                             	|          מאגר החלטות   בית הדין לעניין שירות התעסוקה – קורונה            	| 0.001     	| 146       	| 3505       	| 60195       	|                                         	|
| Decisions of the Israel Lands Council                                                                                             	| החלטות   מועצת מקרקעי ישראל                                              	|           	| 118       	| 11283      	| 162692      	| aggregate file                          	|
| Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal                                                     	| פסקי  דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל           	| 0.02      	| 54        	| 83724      	| 1743419     	| aggregate files                         	|
| Disciplinary Appeals Committee in the   Ministry of Health                                                                        	| ועדת   ערר לדין משמעתי במשרד הבריאות                                     	| 0.004     	| 252       	| 21010      	| 429807      	| 465 files are scanned and didn't parser 	|
| Attorney General's Positions                                                                                                      	| מאגר התייצבויות היועץ המשפטי לממשלה                                      	| 0.008     	| 281       	| 32724      	| 813877      	|                                         	|
| Legal-Opinion of the Attorney General                                                                                             	| מאגר חוות דעת היועץ המשפטי לממשלה                                        	| 0.002     	| 44        	| 7132       	| 188053      	|                                         	|
|                                                                                                                                   	|                                                                          	|           	|           	|            	|             	|                                         	|
| total                                                                                                                             	|                                                                          	| 3.665     	| 389,139   	| 15,161,152 	| 309,976,419 	|                                         	|

We thank <b>Yair Gardin</b> for the referring to the governance data, <b>Elhanan Schwarts</b> for collecting and parsing The Israeli law book, and <b>Jonathan Schler</b> for collecting the judgments of the supreme court.


## Training process
* Vocabulary size: 50,000 tokens
* 4 epochs (1M steps±)
* lr=5e-5
* mlm_probability=0.15
* batch size = 32 (for each gpu)
* NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)

### Additional training settings: 
<b>Fine-tuned [HeBERT](https://github.com/avichaychriqui/HeBERT) model:</b> The first eight layers were freezed (like [Lee et al. (2019)](https://arxiv.org/abs/1911.03090) suggest)<br>
<b>Legal-HeBERT trained from scratch:</b> The training process is similar to [HeBERT](https://github.com/avichaychriqui/HeBERT) and inspired by [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559) <br>

## How to use
The models can be found in huggingface hub and can be fine-tunned to any down-stream task:
```
# !pip install transformers==4.14.1
from transformers import AutoTokenizer, AutoModel

model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model 
model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model=model_name,
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
```
## Stay tuned!
We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.

## If you used this model please cite us as :
Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127

```
@article{chriqui2021hebert,
  title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
  author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
  journal={SSRN preprint:4147127},
  year={2022}
}
```

## Contact us
[Avichay Chriqui](mailto:[email protected]), The Coller AI Lab <br>
[Inbal yahav](mailto:[email protected]), The Coller AI Lab <br>
[Ittai Bar-Siman-Tov](mailto:[email protected]), the BIU Innovation Lab for Law, Data-Science and Digital Ethics <br>

Thank you, תודה, شكرا <br>