otobert / README.md
Shaltiel's picture
Update README.md
95abb46 verified
metadata
license: cc-by-4.0
language:
  - he

OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature

New language model for Hebrew designed specifically for identifying suffixed verbal forms in Modern Hebrew, released here.

This is the base model pretrained with the masked-language-modeling objective.

This model was trained with a special tokenizer which combines the bound suffix of an object pronoun into a single unit (e.g., 专讗讬转讬 讗讜转讜 becomes one unit), and was trained to predict those items during the mask prediction stage as well. For more details, please check out the paper listed on this page.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/otobert')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/otobert')

model.eval()

sentence = '讗谞讬 诇讗 讬讻讜诇 诇讛讙讬讚 诇讱 诪转讬 [MASK] 诇讗讞专讜谞讛.' # Supposed to be 专讗讬转讬 讗讜转讜

output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 7th token (including [CLS])
import torch
top_2 = torch.topk(output.logits[0, 7, :], 2)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print 谞驻讙砖谞讜 / 专讗讬转讬_讗讜转讜 

Citation

If you use OtoBERT in your research, please cite OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature

BibTeX:

tbd

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0