File size: 3,671 Bytes
cb21837
 
3811073
 
 
cb21837
3811073
 
2cb5334
3811073
 
 
 
 
 
 
255556f
3811073
255556f
3811073
 
 
 
 
255556f
3811073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cb5334
3811073
 
 
 
2cb5334
 
 
 
 
3811073
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: cc-by-4.0
language:
- he
inference: false
---
# DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2403.06970).

This is the fine-tuned model for the lemmatization task.  

For the bert-base models for other tasks, see [here](https://huggingface.co/collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b).

## General guidelines for how the lemmatizer works:

Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary. 

- If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.

- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.

- For verbs the lexeme is the 3rd person past singular form. 

This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.

Sample usage:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-lex')
model = AutoModel.from_pretrained('dicta-il/dictabert-lex', trust_remote_code=True)

model.eval()

sentence = '讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诇讬诪讜讚讬讜 讘驻讬住讜诇 诪转讻转 讜讘转讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'
print(model.predict([sentence], tokenizer))
```

Output:
```json
[
  [
    [
      "讘砖谞转",
      "砖谞讛"
    ],
    [
      "1948",
      "1948"
    ],
    [
      "讛砖诇讬诐",
      "讛砖诇讬诐"
    ],
    [
      "讗驻专讬诐",
      "讗驻专讬诐"
    ],
    [
      "拽讬砖讜谉",
      "拽讬砖讜谉"
    ],
    [
      "讗转",
      "讗转"
    ],
    [
      "诇讬诪讜讚讬讜",
      "诇讬诪讜讚"
    ],
    [
      "讘驻讬住讜诇",
      "驻讬住讜诇"
    ],
    [
      "诪转讻转",
      "诪转讻转"
    ],
    [
      "讜讘转讜诇讚讜转",
      "转讜诇讚讛"
    ],
    [
      "讛讗诪谞讜转",
      "讗讜诪谞讜转"
    ],
    [
      "讜讛讞诇",
      "讛讞诇"
    ],
    [
      "诇驻专住诐",
      "驻专住诐"
    ],
    [
      "诪讗诪专讬诐",
      "诪讗诪专"
    ],
    [
      "讛讜诪讜专讬住讟讬讬诐",
      "讛讜诪讜专讬住讟讬"
    ]
  ]
]
```


## Citation

If you use DictaBERT-lex in your research, please cite ```MRL Parsing without Tears: The Case of Hebrew```

**BibTeX:**

```bibtex
@misc{shmidman2024mrl,
      title={MRL Parsing Without Tears: The Case of Hebrew}, 
      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty},
      year={2024},
      eprint={2403.06970},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## License

Shield: [![CC BY 4.0][cc-by-shield]][cc-by]

This work is licensed under a
[Creative Commons Attribution 4.0 International License][cc-by].

[![CC BY 4.0][cc-by-image]][cc-by]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg