Qishuai's picture
Create README.md
0bed7e4

Punctuator for Simplified Chinese

The model is fine-tuned based on DistilBertForTokenClassification for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model bert-base-chinese.

Usage

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")

Model Overview

Training data

Combination of following three dataset:

  • News articles of People's Daily 2014. Reference

Model Performance

  • Validation with MSRA training dataset. Reference
  • Metrics Report:
    precision recall f1-score support
    C_COMMA 0.67 0.59 0.63 91566
    C_DUNHAO 0.50 0.37 0.42 21013
    C_EXLAMATIONMARK 0.23 0.06 0.09 399
    C_PERIOD 0.84 0.99 0.91 44258
    C_QUESTIONMARK 0.00 1.00 0.00 0
    micro avg 0.71 0.67 0.69 157236
    macro avg 0.45 0.60 0.41 157236
    weighted avg 0.69 0.67 0.68 157236