bert-base-sudachitra-v11

This model is a variant of SudachiTra. The differences between the original chiTra v1.1 and bert-base-sudachitra-v11 are:

word_form_type was changed from normalized_nouns to surface
Replacing continuous two empty lines with a dummy entry and an empty line in vocab.txt

Also read the original README.md descriptions below.

(See GitHub - WorksApplications/SudachiTra for the latest README)

Sudachi Transformers (chiTra)

chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.

chiTra pretrained language model

We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.

NWJC was used after cleaning to remove unnecessary sentences.

This model trained BERT using a pre-learning script implemented by NVIDIA.

License

"chiTra" is distributed by National Institute for Japanese Langauge and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.

Citation

@INPROCEEDINGS{katsuta2022chitra,
    author    = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
    title     = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
    booktitle = "言語処理学会第28回年次大会(NLP2022)",
    year      = "2022",
    pages     = "",
    publisher = "言語処理学会",
}