Qishuai
/

distilbert_punctuator_zh

Token Classification

Inference Endpoints

Model card Files Files and versions Community

distilbert_punctuator_zh / README.md

Qishuai's picture

Create README.md

0bed7e4 almost 3 years ago

|

1.54 kB

	# Punctuator for Simplified Chinese

	The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`.

	## Usage

	```python
	from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

	model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
	tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")
	```

	## Model Overview

	### Training data
	Combination of following three dataset:

	- News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus)

	### Model Performance
	- Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA)
	- Metrics Report:
	\| \| precision \| recall \| f1-score \| support \|
	\|:----------------:\|:---------:\|:------:\|:--------:\|:-------:\|
	\| C_COMMA \| 0.67 \| 0.59 \| 0.63 \| 91566 \|
	\| C_DUNHAO \| 0.50 \| 0.37 \| 0.42 \| 21013 \|
	\| C_EXLAMATIONMARK \| 0.23 \| 0.06 \| 0.09 \| 399 \|
	\| C_PERIOD \| 0.84 \| 0.99 \| 0.91 \| 44258 \|
	\| C_QUESTIONMARK \| 0.00 \| 1.00 \| 0.00 \| 0 \|
	\| micro avg \| 0.71 \| 0.67 \| 0.69 \| 157236 \|
	\| macro avg \| 0.45 \| 0.60 \| 0.41 \| 157236 \|
	\| weighted avg \| 0.69 \| 0.67 \| 0.68 \| 157236 \|