lighttransport
/

japanese-tokenizer-cc100

Model card Files Files and versions Community

Edit model card

日本語データセットで train した Tokenizer です.

単体での利用は想定しておらず, LLaMa Tokenizer などにマージして利用するのを想定しています.

Training script

train_jp_tokenizer.py を参照ください.

Trained tokenizer

tokenizer-cc100-ja.json cc100 ja データセットをそのまま(normalize など適用せずに) train したもの. vocab size 30000.

TODO

Normalize した日本語テキストに対して train する
マージした Tokenizer をアップロードする

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference API

Unable to determine this model's library. Check the docs .