|
# Punctuator for Simplified Chinese |
|
|
|
The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast |
|
|
|
model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh") |
|
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh") |
|
``` |
|
|
|
## Model Overview |
|
|
|
### Training data |
|
Combination of following three dataset: |
|
|
|
- News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus) |
|
|
|
### Model Performance |
|
- Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA) |
|
- Metrics Report: |
|
| | precision | recall | f1-score | support | |
|
|:----------------:|:---------:|:------:|:--------:|:-------:| |
|
| C_COMMA | 0.67 | 0.59 | 0.63 | 91566 | |
|
| C_DUNHAO | 0.50 | 0.37 | 0.42 | 21013 | |
|
| C_EXLAMATIONMARK | 0.23 | 0.06 | 0.09 | 399 | |
|
| C_PERIOD | 0.84 | 0.99 | 0.91 | 44258 | |
|
| C_QUESTIONMARK | 0.00 | 1.00 | 0.00 | 0 | |
|
| micro avg | 0.71 | 0.67 | 0.69 | 157236 | |
|
| macro avg | 0.45 | 0.60 | 0.41 | 157236 | |
|
| weighted avg | 0.69 | 0.67 | 0.68 | 157236 | |
|
|
|
|
|
|