File size: 1,539 Bytes
0bed7e4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Punctuator for Simplified Chinese
The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`.
## Usage
```python
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")
```
## Model Overview
### Training data
Combination of following three dataset:
- News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus)
### Model Performance
- Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA)
- Metrics Report:
| | precision | recall | f1-score | support |
|:----------------:|:---------:|:------:|:--------:|:-------:|
| C_COMMA | 0.67 | 0.59 | 0.63 | 91566 |
| C_DUNHAO | 0.50 | 0.37 | 0.42 | 21013 |
| C_EXLAMATIONMARK | 0.23 | 0.06 | 0.09 | 399 |
| C_PERIOD | 0.84 | 0.99 | 0.91 | 44258 |
| C_QUESTIONMARK | 0.00 | 1.00 | 0.00 | 0 |
| micro avg | 0.71 | 0.67 | 0.69 | 157236 |
| macro avg | 0.45 | 0.60 | 0.41 | 157236 |
| weighted avg | 0.69 | 0.67 | 0.68 | 157236 |
|