j5ng's picture
Update README.md
35a522c
metadata
license: apache-2.0
language:
  - ko
pipeline_tag: text-classification

formal_classifier

formal classifier or honorific classifier

ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ฐ˜๋ง ๋ถ„๋ฅ˜๊ธฐ

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')

formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ–ˆ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜?")) 
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]

๋ฐ์ดํ„ฐ ์…‹ ์ถœ์ฒ˜

์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)

: https://github.com/smilegate-ai/korean_smile_style_dataset

AI ํ—ˆ๋ธŒ ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜

: https://www.aihub.or.kr/

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(AIํ—ˆ๋ธŒ๋Š” ์ง์ ‘๋‹ค์šด๋กœ๋“œ๋งŒ ๊ฐ€๋Šฅ)

wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv

๊ฐœ๋ฐœ ํ™˜๊ฒฝ

Python3.9
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3

์‚ฌ์šฉ ๋ชจ๋ธ

beomi/kcbert-base


์˜ˆ์‹œ

sentence label
๊ณต๋ถ€๋ฅผ ์—ด์‹ฌํžˆ ํ•ด๋„ ์—ด์‹ฌํžˆ ํ•œ ๋งŒํผ ์„ฑ์ ์ด ์ž˜ ๋‚˜์˜ค์ง€ ์•Š์•„ 0
์•„๋“ค์—๊ฒŒ ๋ณด๋‚ด๋Š” ๋ฌธ์ž๋ฅผ ํ†ตํ•ด ๊ด€๊ณ„๊ฐ€ ํšŒ๋ณต๋˜๊ธธ ๋ฐ”๋ž„๊ฒŒ์š” 1
์ฐธ ์—ด์‹ฌํžˆ ์‚ฌ์‹  ๋ณด๋žŒ์ด ์žˆ์œผ์‹œ๋„ค์š” 1
๋‚˜๋„ ์Šค์‹œ ์ข‹์•„ํ•จ ์ด๋ฒˆ ๋‹ฌ๋ถ€ํ„ฐ ์˜๊ตญ ๊ฐˆ ๋“ฏ 0
๋ณธ๋ถ€์žฅ๋‹˜์ด ๋‚ด๊ฐ€ ํ•  ์ˆ˜ ์—†๋Š” ์—…๋ฌด๋ฅผ ๊ณ„์† ์ฃผ์…”์„œ ํž˜๋“ค์–ด 0

๋ถ„ํฌ

label train test
0 133,430 34,908
1 112,828 29,839

๊ฒฐ๊ณผ

์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ•˜์…จ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜์„ธ์š”? : ์กด๋Œ“๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  99.19% )
์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ–ˆ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜? : ๋ฐ˜๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  92.86% )

์ธ์šฉ

@misc{SmilegateAI2022KoreanSmileStyleDataset,
  title         = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
  author        = {Seonghyun Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}