|
--- |
|
language: |
|
- en |
|
- ko |
|
license: llama3 |
|
library_name: transformers |
|
tags: |
|
- ko |
|
- eval |
|
- llm-eval |
|
base_model: |
|
- meta-llama/Meta-Llama-3-8B-Instruct |
|
datasets: |
|
- nayohan/feedback-collection-ko |
|
- nayohan/feedback-collection-ko-chat |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# **Introduction** |
|
This model translated the [prometheus-eval/Feedback-Collection](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) dataset into Korean and trained on the llama3-8b-it model. |
|
Train Dataset: [nayohan/feedback-collection-ko](https://huggingface.co/datasets/nayohan/feedback-collection-ko) |
|
|
|
### **Loading the Model** |
|
|
|
Use the following Python code to load the model: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "nayohan/llama3-8b-it-prometheus-ko" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
``` |
|
|
|
### **Generating Text** |
|
System prompt is fixed, and you can set the score rubric according to the given task, and then change the orig_instruction, orig_response, and orig_reference_answer to evaluate it. |
|
```python |
|
system_prompt = """###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. |
|
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general. |
|
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric. |
|
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\" |
|
4. Please do not generate any other opening, closing, and explanations.""" |
|
|
|
sample = { |
|
'orig_instruction': "λλ μ²¨λ¨ κΈ°μ νλ‘μ νΈλ₯Ό μ§ννλ νμ μλ€. κ·Έλ¬λ μ΅κ·Ό νλ‘μ νΈ λ°©ν₯μ λκ³ νμλ€ μ¬μ΄μ μ§μμ μΈ κ°λ±μ΄ λ°μνκ³ μλ€. ν κ·Έλ£Ήμ κΈμ§μ μ΄κ³ μννμ§λ§ μ μ¬μ μΌλ‘ κ²μμ λ°κΏ μ μλ μ κ·Όλ²μ κ°λ ₯νκ² μΉνΈνκ³ μλ€. λμ‘°μ μΌλ‘, λ€λ₯Έ κ·Έλ£Ήμ λ³΄λ€ μΈ‘μ λκ³ λ μμ νλ©° μ
μ¦λ μ λ΅μ μ νΈνλ€. κ²°κ³Όμ μΌλ‘ μ°λ¦¬ νμ λΆμ΄λμ΄ μ§μ μ μ΄λ£° μ μλ€. μ°λ¦¬μ λνλ₯Ό μ€μ¬νκ³ ν΄κ²°μ μ΄λμ΄λΌ μ μλ AI λͺ¨λΈμ΄ νμνλ€. μ΄λ¬ν μν©μ λμνμ¬ AI λͺ¨λΈμ 무μμ λ§ν΄μΌ νλκ°?", |
|
'orig_response': "κ·Έλ¬λκΉ νλ‘μ νΈ λ°©ν₯μ ν©μκ° μ λλ νμ μλ κ±° μλμΌ? λ€λ€ μ λ§λλ‘ λ°°μμΌ ν κ² κ°λ€μ. μ΄μ©λ©΄ λμ μ λμ§κ³ μ΄λ μͺ½μ΄ μΉλ¦¬νλμ§ λ΄μΌ ν κ² κ°μμ. κ·Έλ κ² νλ©΄ λ
Όμμ΄ μκ³ λͺ¨λκ° μΌν°λ‘ λμκ° μ μμ΅λλ€. μννλ μμ νλ μκ΄μμ΄μ. νλλ₯Ό 골λΌμ κ·Έλ₯ κ°μΈμ. κ²λ€κ°, λͺ¨λ κ²μ΄ 무λμ§λ©΄ μλ‘ λΉλνκ³ λμ΄κ° μ μμ΅λλ€. μλλ©΄ λ μ’μ κ²μ, μ΄λ€ κ·Έλ£Ήμ μμ΄λμ΄κ° λ λμμ§ λ³΄κΈ° μν κ²½μμ΄ μ μ λΌ? ν¨λ°°μλ μ°μΉμλ₯Ό μν΄ μ μ¬μ μ¬μΌ ν΄μ.", |
|
'orig_reference_answer': "μ΄ νμ λͺ¨λ μ¬λλ€μ΄ νλ‘μ νΈμ μ΄μ μ μ΄κ³ μ±κ³΅νκΈ°λ₯Ό μνλ€λ κ²μ λΆλͺ
νλ©°, μ΄λ λͺ¨λ ν΄κ²°μ νλ₯ν μΆλ°μ μ΄λ€. λν κ°λ±μ μνκ³Ό νμ μ λν μλ‘ λ€λ₯Έ κ΄μ μμ λ°μνλ€λ κ²λ λΆλͺ
ν©λλ€. λ λ€ νλ‘μ νΈμ μ±κ³΅μ μ€μν κ³ λ € μ¬νμ
λλ€. λ μ κ·Όλ² λͺ¨λμμ μ ν¨ν μ μ μΈμ νλ κ²μΌλ‘ μμνκ² μ΅λλ€. κΈμ§μ μΈ μ κ·Όλ²μ μΉνΈνλ νμ λμ 보μκ³Ό νκΈ°μ μΈ νμ μ μ μ¬λ ₯μ μν΄ μ£Όλλλ©°, μ΄λ λͺ¨λ μ²¨λ¨ νλ‘μ νΈμμ νλ₯νκ³ νμμ μ
λλ€.", |
|
'orig_criteria':'λͺ¨νμ λνμμ κ°λ± ν΄κ²°μ μΌλ§λ ν¨κ³Όμ μΌλ‘ μ²λ¦¬νλκ°?', |
|
'orig_score1_description':'λͺ¨λΈμ κ°λ±μ΄λ μ€ν΄λ₯Ό κ°μ€μμΌ λ¬Έμ λ₯Ό μ€μ¬νκ±°λ ν΄κ²°ν μ μλ λ₯λ ₯μ 보μ΄μ§ μλλ€.', |
|
'orig_score2_description':'μ΄ λͺ¨λΈμ κ°λ±μ λν μΈμμ΄ μμ§λ§ μ΄λ₯Ό ν΄κ²°νλ €λ μλλ ν¨κ³Όκ° μκ±°λ μλͺ»λ μ§μΉ¨μ κ°μ§κ³ μλ€.', |
|
'orig_score3_description':'μ΄ λͺ¨λΈμ κ°λ±μ μ λΉν μ²λ¦¬νμ¬ μΌλΆ μ±κ³΅μ μΈ ν΄κ²° μ μ μ 보μ¬μ£Όμ§λ§ λ μΌκ΄μ±μ΄ μμ μ μλ€.', |
|
'orig_score4_description':'μ΄ λͺ¨λΈμ κ°λ±μ μ μ²λ¦¬νμ¬ κΈ΄μ₯μ νμ°μν€κ³ ν΄κ²°μ ν¨κ³Όμ μΌλ‘ μλ΄νμ§λ§ λ―ΈμΈν λ―ΈλλΌμ΄ μμ΅λλ€.', |
|
'orig_score5_description':'μ΄ λͺ¨λΈμ κ°λ±μ νλ₯νκ² κ΄λ¦¬νκ³ , μ§μμ μΌλ‘ κΈ΄μ₯μ νμ°μν€λ©°, λνλ₯Ό ννμΌλ‘ μλ΄νκ³ κΈμ μ μΈ λν νκ²½μ μ‘°μ±νλ€.', |
|
'orig_feedback': 'μ 곡λ μλ΅μ λΉλ©΄ν λ¬Έμ λ₯Ό μ‘°μ νκ±°λ ν΄κ²°νλ λ₯λ ₯μ 보μ¬μ£Όμ§ μλλ€. λμ νμ μ°λ €λ₯Ό μ¬μννκ³ μ μ¬μ μΈ κ²°κ³Όμ λν κ³ λ € μμ΄ λμ μ λμ§κ±°λ λνλ₯Ό κ°μ΅νλ κ²κ³Ό κ°μ λΉκ±΄μ€μ μ루μ
μ μ μνλ€. λν μλ΅μ μν©μ΄ μλͺ»λλ©΄ ν ꡬμ±μλ€μ΄ μλ‘λ₯Ό λΉλν΄μΌ νλ€λ κ²μ μμνλ€. κ°λ±μ λμ± μ
νμν¨λ€. 건μ€μ μΈ λνλ₯Ό μ₯λ €νκ±°λ λ μ κ·Όλ² μ¬μ΄μ μ€κ° μ§μ μ μ°Ύλ κ²μ μ€μμ±μ μΈμ νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€.', |
|
'orig_score': 1, |
|
} |
|
|
|
instruction = f"""###The instruction to evaluate: {sample['orig_instruction']} |
|
###Response to evaluate: {sample['orig_response']} |
|
###Reference Answer (Score 5): {sample['orig_reference_answer']} |
|
###Score Rubrics: [{sample['orig_criteria']}] |
|
Score 1: {sample['orig_score1_description']} |
|
Score 2: {sample['orig_score2_description']} |
|
Score 3: {sample['orig_score3_description']} |
|
Score 4: {sample['orig_score4_description']} |
|
Score 5: {sample['orig_score5_description']} |
|
###Feedback:""" |
|
|
|
# for training |
|
# output = f"""{sample['orig_feedback']} |
|
# [RESULT] {sample['orig_score']}""" |
|
|
|
conversation = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": instruction}, |
|
# {"role": "assistant", "content": output} |
|
] |
|
|
|
input_ids = tokenizer.apply_chat_template( |
|
conversation, |
|
tokenize=True, |
|
add_generation_prompt=True, |
|
return_tensors='pt' |
|
).to("cuda") |
|
|
|
output = model.generate(input_ids, max_new_tokens=512) |
|
output_text = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) |
|
print(output_text) |
|
``` |
|
If you don't have a reference text, it can work without one. The model evaluates orig_response, the sentence after orig_instruction. Use the following template code. |
|
```python |
|
instruction = f"""###The instruction to evaluate: {sample['orig_instruction']} |
|
###Response to evaluate: {sample['orig_response']} |
|
###Score Rubrics: [{sample['orig_criteria']}] |
|
Score 1: {sample['orig_score1_description']} |
|
Score 2: {sample['orig_score2_description']} |
|
Score 3: {sample['orig_score3_description']} |
|
Score 4: {sample['orig_score4_description']} |
|
Score 5: {sample['orig_score5_description']} |
|
###Feedback:""" |
|
``` |
|
The model was trained with feedback truncated, feedback can sometimes be represented with some truncation. |
|
``` |
|
# Result with orig_reference_answer |
|
# OUTPUT: μ΄ λμμ κ°λ± ν΄κ²°μ λν μ΄ν΄κ° λΆμ‘±ν¨μ 보μ¬μ€λ€. λμ μ λμ§κ±°λ κ²½μμ μ μνλ κ²κ³Ό κ°μ΄ μ 곡λ μ μμ λ¬Έμ μ 볡μ‘μ±μ 무μνκ³ ν λ΄μ λ€μν κ΄μ μ 무μνλ€. μλ΅μ λ μ κ·Όλ²μ μ μ¬μ κ°μΉλ₯Ό μΈμ νμ§ μμΌλ©° ν ꡬμ±μ κ°μ μ΄ν΄μ μ‘΄μ€μ μ΄μ§νμ§λ μλλ€. λν μλ΅μ νμ μ΄μ κ³Ό νλ‘μ νΈμ λν νμ μ μΈμ νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€. |
|
[RESULT] 1 |
|
# Result without orig_reference_answer |
|
# OUTPUT: λμμ κ°λ± ν΄κ²°μ λν μ΄ν΄λ₯Ό λνλ΄μ§ μλλ€. AI λͺ¨λΈμ κ°λ±μ ν΄κ²°ν기보λ€λ κ°λ±μ μ
νμν€λ κ²μ μ μνλ©°, μ΄λ μ μ 루λΈλ¦μ λ°λΌ μꡬ μ¬νμ μ΄κΈλλ€. λμ μ λμ§κ³ κ²½μμ μ μνλ κ²μ ν ꡬμ±μ κ°μ κΈ΄μ₯μ νμ°μν€λ λ° λμμ΄ λμ§ μκ³ μ€νλ € λ λ§μ κ°λ±μ μ΄λ°ν μ μλ€. λν, ν ꡬμ±μμ΄ λ λμ μμ΄λμ΄λ₯Ό κ°λ κ²μ΄ μλλΌ "λ λμ" μμ΄λμ΄λ₯Ό κ°λλ€λ κ²μ μμνλ κ²μ ν ꡬμ±μ κ°μ νν©μ μ΄μ§νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€. |
|
[RESULT] 1 |
|
``` |
|
If you just want to get a score from the evaluation, you can use the following extract_score function. |
|
```python |
|
import re |
|
def extract_score(text): |
|
pattern = re.compile(r'\[RESULT\]\s+([0-5])') |
|
match = pattern.search(text) |
|
if match: |
|
score = int(match.group(1)) |
|
else: score=0 |
|
return score |
|
|
|
predict_score = extract_score(output_text) |
|
print(predict_score) # 1 |
|
``` |
|
|
|
### **Heatmap Visualize** |
|
For [eng->eng] we randomly sampled 200 evalset from the training data, extracted scores from the model-generated sentences, and compared them to the correct answers. |
|
The training and test datasets are not separated, so we can only see how well the model learned. |
|
For [ko->ko] sampled 200 evalset in this [testset](https://huggingface.co/datasets/nayohan/feedback-collection-ko-chat/viewer/default/test). llama3-8b-it-prometheus-ko only use trin set. |
|
|
|
- prometheus-7b-v1.0 (english train-> english inference) # 3 failed to output a score, total 197 |
|
- llama3-8b-it-prometheus-ko (korean train-> korean inference) # total 200 |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6152b4b9ecf3ca6ab820e325/ssZRGTysyiOZD4ttNOD4s.png) |
|
|
|
### **Citation** |
|
```bibtex |
|
@misc{kim2023prometheus, |
|
title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models}, |
|
author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo}, |
|
year={2023}, |
|
eprint={2310.08491}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
Our trainig code can be found here: [TBD] |