File size: 1,539 Bytes

0bed7e4

# Punctuator for Simplified Chinese

The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`.

## Usage

```python
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")
```

## Model Overview

### Training data
Combination of following three dataset:

- News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus)

### Model Performance
- Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA)
- Metrics Report:
    |                  | precision | recall | f1-score | support |
    |:----------------:|:---------:|:------:|:--------:|:-------:|
    |      C_COMMA     |    0.67   |  0.59  |   0.63   |  91566  |
    |     C_DUNHAO     |    0.50   |  0.37  |   0.42   |  21013  |
    | C_EXLAMATIONMARK |    0.23   |  0.06  |   0.09   |   399   |
    |     C_PERIOD     |    0.84   |  0.99  |   0.91   |  44258  |
    |  C_QUESTIONMARK  |    0.00   |  1.00  |   0.00   |    0    |
    |     micro avg    |    0.71   |  0.67  |   0.69   |  157236 |
    |     macro avg    |    0.45   |  0.60  |   0.41   |  157236 |
    |   weighted avg   |    0.69   |  0.67  |   0.68   |  157236 |