metadata

language:
  - ko
tags:
  - classification
license: mit
datasets:
  - nsmc
widget:
  - text: 불후의 명작입니다! 이렇게 감동적인 내용은 처음이에요
    example_title: Positive
  - text: 시간이 정말 아깝습니다. 10점 만점에 1점도 아까워요..
    example_title: Negative
metrics:
  - accuracy
  - f1
  - precision
  - recall- accuracy

Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset)

Usage (Amazon SageMaker inference applicable)

It uses the interface of the SageMaker Inference Toolkit as is, so it can be easily deployed to SageMaker Endpoint.

inference_nsmc.py

import json
import sys
import logging
import torch
from torch import nn
from transformers import ElectraConfig
from transformers import ElectraModel, AutoTokenizer, ElectraTokenizer, ElectraForSequenceClassification

logging.basicConfig(
    level=logging.INFO, 
    format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(filename='tmp.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)

max_seq_length = 128
classes = ['Neg', 'Pos']

tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/koelectra-small-v3-nsmc")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def model_fn(model_path=None):
    ####
    # If you have your own trained model
    # Huggingface pre-trained model: 'monologg/koelectra-small-v3-discriminator'
    ####
    #config = ElectraConfig.from_json_file(f'{model_path}/config.json')
    #model = ElectraForSequenceClassification.from_pretrained(f'{model_path}/model.pth', config=config)
    
    # Download model from the Huggingface hub
    model = ElectraForSequenceClassification.from_pretrained('daekeun-ml/koelectra-small-v3-nsmc')   
    model.to(device)
    return model


def input_fn(input_data, content_type="application/jsonlines"): 
    data_str = input_data.decode("utf-8")
    jsonlines = data_str.split("\n")
    transformed_inputs = []

    for jsonline in jsonlines:
        text = json.loads(jsonline)["text"][0]
        logger.info("input text: {}".format(text))          
        encode_plus_token = tokenizer.encode_plus(
            text,
            max_length=max_seq_length,
            add_special_tokens=True,
            return_token_type_ids=False,
            padding="max_length",
            return_attention_mask=True,
            return_tensors="pt",
            truncation=True,
        )
        transformed_inputs.append(encode_plus_token)
        
    return transformed_inputs


def predict_fn(transformed_inputs, model):
    predicted_classes = []
    
    for data in transformed_inputs:
        data = data.to(device)
        output = model(**data)

        softmax_fn = nn.Softmax(dim=1)
        softmax_output = softmax_fn(output[0])
        _, prediction = torch.max(softmax_output, dim=1)

        predicted_class_idx = prediction.item()
        predicted_class = classes[predicted_class_idx]
        score = softmax_output[0][predicted_class_idx]
        logger.info("predicted_class: {}".format(predicted_class))

        prediction_dict = {}
        prediction_dict["predicted_label"] = predicted_class
        prediction_dict['score'] = score.cpu().detach().numpy().tolist()

        jsonline = json.dumps(prediction_dict)
        logger.info("jsonline: {}".format(jsonline))        
        predicted_classes.append(jsonline)

    predicted_classes_jsonlines = "\n".join(predicted_classes)
    return predicted_classes_jsonlines


def output_fn(outputs, accept="application/jsonlines"):
    return outputs, accept

test.py

>>> from inference_nsmc import model_fn, input_fn, predict_fn, output_fn
>>> with open('samples/nsmc.txt', mode='rb') as file:
>>>     model_input_data = file.read()
>>> model = model_fn()
>>> transformed_inputs = input_fn(model_input_data)
>>> predicted_classes_jsonlines = predict_fn(transformed_inputs, model)
>>> model_outputs = output_fn(predicted_classes_jsonlines)
>>> print(model_outputs[0])    
   
[{inference_nsmc.py:47} INFO - input text: 이 영화는 최고의 영화입니다
[{inference_nsmc.py:47} INFO - input text: 최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다
[{inference_nsmc.py:77} INFO - predicted_class: Pos
[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Pos", "score": 0.9619030952453613}
[{inference_nsmc.py:77} INFO - predicted_class: Neg
[{inference_nsmc.py:84} INFO - jsonline: {"predicted_label": "Neg", "score": 0.9994170665740967}
{"predicted_label": "Pos", "score": 0.9619030952453613}
{"predicted_label": "Neg", "score": 0.9994170665740967}

Sample data (samples/nsmc.txt)

{"text": ["이 영화는 최고의 영화입니다"]}
{"text": ["최악이에요. 배우의 연기력도 좋지 않고 내용도 너무 허접합니다"]}

References

KoELECTRA: https://github.com/monologg/KoELECTRA
Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc