Qishuai's picture
Create README.md
0bed7e4
|
raw
history blame
1.54 kB
# Punctuator for Simplified Chinese
The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`.
## Usage
```python
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")
```
## Model Overview
### Training data
Combination of following three dataset:
- News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus)
### Model Performance
- Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA)
- Metrics Report:
| | precision | recall | f1-score | support |
|:----------------:|:---------:|:------:|:--------:|:-------:|
| C_COMMA | 0.67 | 0.59 | 0.63 | 91566 |
| C_DUNHAO | 0.50 | 0.37 | 0.42 | 21013 |
| C_EXLAMATIONMARK | 0.23 | 0.06 | 0.09 | 399 |
| C_PERIOD | 0.84 | 0.99 | 0.91 | 44258 |
| C_QUESTIONMARK | 0.00 | 1.00 | 0.00 | 0 |
| micro avg | 0.71 | 0.67 | 0.69 | 157236 |
| macro avg | 0.45 | 0.60 | 0.41 | 157236 |
| weighted avg | 0.69 | 0.67 | 0.68 | 157236 |