File size: 10,299 Bytes
65f3364 03bc1be 65f3364 03bc1be 1732f05 03bc1be 65f3364 03bc1be dd2c0e6 03bc1be a11436d ae787f6 e1e149e b492dd9 ae787f6 a11436d 03bc1be 1732f05 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
language:
- en
- ko
license: llama3
library_name: transformers
tags:
- ko
- eval
- llm-eval
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
datasets:
- nayohan/feedback-collection-ko
- nayohan/feedback-collection-ko-chat
pipeline_tag: text-generation
---
# **Introduction**
This model translated the [prometheus-eval/Feedback-Collection](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) dataset into Korean and trained on the llama3-8b-it model.
Train Dataset: [nayohan/feedback-collection-ko](https://huggingface.co/datasets/nayohan/feedback-collection-ko)
### **Loading the Model**
Use the following Python code to load the model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nayohan/llama3-8b-it-prometheus-ko"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16
)
```
### **Generating Text**
System prompt is fixed, and you can set the score rubric according to the given task, and then change the orig_instruction, orig_response, and orig_reference_answer to evaluate it.
```python
system_prompt = """###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations."""
sample = {
'orig_instruction': "λλ μ²¨λ¨ κΈ°μ νλ‘μ νΈλ₯Ό μ§ννλ νμ μλ€. κ·Έλ¬λ μ΅κ·Ό νλ‘μ νΈ λ°©ν₯μ λκ³ νμλ€ μ¬μ΄μ μ§μμ μΈ κ°λ±μ΄ λ°μνκ³ μλ€. ν κ·Έλ£Ήμ κΈμ§μ μ΄κ³ μννμ§λ§ μ μ¬μ μΌλ‘ κ²μμ λ°κΏ μ μλ μ κ·Όλ²μ κ°λ ₯νκ² μΉνΈνκ³ μλ€. λμ‘°μ μΌλ‘, λ€λ₯Έ κ·Έλ£Ήμ λ³΄λ€ μΈ‘μ λκ³ λ μμ νλ©° μ
μ¦λ μ λ΅μ μ νΈνλ€. κ²°κ³Όμ μΌλ‘ μ°λ¦¬ νμ λΆμ΄λμ΄ μ§μ μ μ΄λ£° μ μλ€. μ°λ¦¬μ λνλ₯Ό μ€μ¬νκ³ ν΄κ²°μ μ΄λμ΄λΌ μ μλ AI λͺ¨λΈμ΄ νμνλ€. μ΄λ¬ν μν©μ λμνμ¬ AI λͺ¨λΈμ 무μμ λ§ν΄μΌ νλκ°?",
'orig_response': "κ·Έλ¬λκΉ νλ‘μ νΈ λ°©ν₯μ ν©μκ° μ λλ νμ μλ κ±° μλμΌ? λ€λ€ μ λ§λλ‘ λ°°μμΌ ν κ² κ°λ€μ. μ΄μ©λ©΄ λμ μ λμ§κ³ μ΄λ μͺ½μ΄ μΉλ¦¬νλμ§ λ΄μΌ ν κ² κ°μμ. κ·Έλ κ² νλ©΄ λ
Όμμ΄ μκ³ λͺ¨λκ° μΌν°λ‘ λμκ° μ μμ΅λλ€. μννλ μμ νλ μκ΄μμ΄μ. νλλ₯Ό 골λΌμ κ·Έλ₯ κ°μΈμ. κ²λ€κ°, λͺ¨λ κ²μ΄ 무λμ§λ©΄ μλ‘ λΉλνκ³ λμ΄κ° μ μμ΅λλ€. μλλ©΄ λ μ’μ κ²μ, μ΄λ€ κ·Έλ£Ήμ μμ΄λμ΄κ° λ λμμ§ λ³΄κΈ° μν κ²½μμ΄ μ μ λΌ? ν¨λ°°μλ μ°μΉμλ₯Ό μν΄ μ μ¬μ μ¬μΌ ν΄μ.",
'orig_reference_answer': "μ΄ νμ λͺ¨λ μ¬λλ€μ΄ νλ‘μ νΈμ μ΄μ μ μ΄κ³ μ±κ³΅νκΈ°λ₯Ό μνλ€λ κ²μ λΆλͺ
νλ©°, μ΄λ λͺ¨λ ν΄κ²°μ νλ₯ν μΆλ°μ μ΄λ€. λν κ°λ±μ μνκ³Ό νμ μ λν μλ‘ λ€λ₯Έ κ΄μ μμ λ°μνλ€λ κ²λ λΆλͺ
ν©λλ€. λ λ€ νλ‘μ νΈμ μ±κ³΅μ μ€μν κ³ λ € μ¬νμ
λλ€. λ μ κ·Όλ² λͺ¨λμμ μ ν¨ν μ μ μΈμ νλ κ²μΌλ‘ μμνκ² μ΅λλ€. κΈμ§μ μΈ μ κ·Όλ²μ μΉνΈνλ νμ λμ 보μκ³Ό νκΈ°μ μΈ νμ μ μ μ¬λ ₯μ μν΄ μ£Όλλλ©°, μ΄λ λͺ¨λ μ²¨λ¨ νλ‘μ νΈμμ νλ₯νκ³ νμμ μ
λλ€.",
'orig_criteria':'λͺ¨νμ λνμμ κ°λ± ν΄κ²°μ μΌλ§λ ν¨κ³Όμ μΌλ‘ μ²λ¦¬νλκ°?',
'orig_score1_description':'λͺ¨λΈμ κ°λ±μ΄λ μ€ν΄λ₯Ό κ°μ€μμΌ λ¬Έμ λ₯Ό μ€μ¬νκ±°λ ν΄κ²°ν μ μλ λ₯λ ₯μ 보μ΄μ§ μλλ€.',
'orig_score2_description':'μ΄ λͺ¨λΈμ κ°λ±μ λν μΈμμ΄ μμ§λ§ μ΄λ₯Ό ν΄κ²°νλ €λ μλλ ν¨κ³Όκ° μκ±°λ μλͺ»λ μ§μΉ¨μ κ°μ§κ³ μλ€.',
'orig_score3_description':'μ΄ λͺ¨λΈμ κ°λ±μ μ λΉν μ²λ¦¬νμ¬ μΌλΆ μ±κ³΅μ μΈ ν΄κ²° μ μ μ 보μ¬μ£Όμ§λ§ λ μΌκ΄μ±μ΄ μμ μ μλ€.',
'orig_score4_description':'μ΄ λͺ¨λΈμ κ°λ±μ μ μ²λ¦¬νμ¬ κΈ΄μ₯μ νμ°μν€κ³ ν΄κ²°μ ν¨κ³Όμ μΌλ‘ μλ΄νμ§λ§ λ―ΈμΈν λ―ΈλλΌμ΄ μμ΅λλ€.',
'orig_score5_description':'μ΄ λͺ¨λΈμ κ°λ±μ νλ₯νκ² κ΄λ¦¬νκ³ , μ§μμ μΌλ‘ κΈ΄μ₯μ νμ°μν€λ©°, λνλ₯Ό ννμΌλ‘ μλ΄νκ³ κΈμ μ μΈ λν νκ²½μ μ‘°μ±νλ€.',
'orig_feedback': 'μ 곡λ μλ΅μ λΉλ©΄ν λ¬Έμ λ₯Ό μ‘°μ νκ±°λ ν΄κ²°νλ λ₯λ ₯μ 보μ¬μ£Όμ§ μλλ€. λμ νμ μ°λ €λ₯Ό μ¬μννκ³ μ μ¬μ μΈ κ²°κ³Όμ λν κ³ λ € μμ΄ λμ μ λμ§κ±°λ λνλ₯Ό κ°μ΅νλ κ²κ³Ό κ°μ λΉκ±΄μ€μ μ루μ
μ μ μνλ€. λν μλ΅μ μν©μ΄ μλͺ»λλ©΄ ν ꡬμ±μλ€μ΄ μλ‘λ₯Ό λΉλν΄μΌ νλ€λ κ²μ μμνλ€. κ°λ±μ λμ± μ
νμν¨λ€. 건μ€μ μΈ λνλ₯Ό μ₯λ €νκ±°λ λ μ κ·Όλ² μ¬μ΄μ μ€κ° μ§μ μ μ°Ύλ κ²μ μ€μμ±μ μΈμ νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€.',
'orig_score': 1,
}
instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
###Response to evaluate: {sample['orig_response']}
###Reference Answer (Score 5): {sample['orig_reference_answer']}
###Score Rubrics: [{sample['orig_criteria']}]
Score 1: {sample['orig_score1_description']}
Score 2: {sample['orig_score2_description']}
Score 3: {sample['orig_score3_description']}
Score 4: {sample['orig_score4_description']}
Score 5: {sample['orig_score5_description']}
###Feedback:"""
# for training
# output = f"""{sample['orig_feedback']}
# [RESULT] {sample['orig_score']}"""
conversation = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": instruction},
# {"role": "assistant", "content": output}
]
input_ids = tokenizer.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_tensors='pt'
).to("cuda")
output = model.generate(input_ids, max_new_tokens=512)
output_text = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(output_text)
```
If you don't have a reference text, it can work without one. The model evaluates orig_response, the sentence after orig_instruction. Use the following template code.
```python
instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
###Response to evaluate: {sample['orig_response']}
###Score Rubrics: [{sample['orig_criteria']}]
Score 1: {sample['orig_score1_description']}
Score 2: {sample['orig_score2_description']}
Score 3: {sample['orig_score3_description']}
Score 4: {sample['orig_score4_description']}
Score 5: {sample['orig_score5_description']}
###Feedback:"""
```
The model was trained with feedback truncated, feedback can sometimes be represented with some truncation.
```
# Result with orig_reference_answer
# OUTPUT: μ΄ λμμ κ°λ± ν΄κ²°μ λν μ΄ν΄κ° λΆμ‘±ν¨μ 보μ¬μ€λ€. λμ μ λμ§κ±°λ κ²½μμ μ μνλ κ²κ³Ό κ°μ΄ μ 곡λ μ μμ λ¬Έμ μ 볡μ‘μ±μ 무μνκ³ ν λ΄μ λ€μν κ΄μ μ 무μνλ€. μλ΅μ λ μ κ·Όλ²μ μ μ¬μ κ°μΉλ₯Ό μΈμ νμ§ μμΌλ©° ν ꡬμ±μ κ°μ μ΄ν΄μ μ‘΄μ€μ μ΄μ§νμ§λ μλλ€. λν μλ΅μ νμ μ΄μ κ³Ό νλ‘μ νΈμ λν νμ μ μΈμ νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€.
[RESULT] 1
# Result without orig_reference_answer
# OUTPUT: λμμ κ°λ± ν΄κ²°μ λν μ΄ν΄λ₯Ό λνλ΄μ§ μλλ€. AI λͺ¨λΈμ κ°λ±μ ν΄κ²°ν기보λ€λ κ°λ±μ μ
νμν€λ κ²μ μ μνλ©°, μ΄λ μ μ 루λΈλ¦μ λ°λΌ μꡬ μ¬νμ μ΄κΈλλ€. λμ μ λμ§κ³ κ²½μμ μ μνλ κ²μ ν ꡬμ±μ κ°μ κΈ΄μ₯μ νμ°μν€λ λ° λμμ΄ λμ§ μκ³ μ€νλ € λ λ§μ κ°λ±μ μ΄λ°ν μ μλ€. λν, ν ꡬμ±μμ΄ λ λμ μμ΄λμ΄λ₯Ό κ°λ κ²μ΄ μλλΌ "λ λμ" μμ΄λμ΄λ₯Ό κ°λλ€λ κ²μ μμνλ κ²μ ν ꡬμ±μ κ°μ νν©μ μ΄μ§νμ§ μλλ€. λ°λΌμ μ 체 μ μλ 1μ΄λ€.
[RESULT] 1
```
If you just want to get a score from the evaluation, you can use the following extract_score function.
```python
import re
def extract_score(text):
pattern = re.compile(r'\[RESULT\]\s+([0-5])')
match = pattern.search(text)
if match:
score = int(match.group(1))
else: score=0
return score
predict_score = extract_score(output_text)
print(predict_score) # 1
```
### **Heatmap Visualize**
[eng->eng] we randomly sampled 200 evalset from the [training data](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection), extracted scores from the model-generated sentences, and compared them to the correct answers. The training and test datasets are not separated, so we can only see how well the model learned.
[ko->ko] sampled 200 evalset in this [testset](https://huggingface.co/datasets/nayohan/feedback-collection-ko-chat/viewer/default/test). llama3-8b-it-prometheus-ko only use train set.
- prometheus-7b-v1.0 (english train-> english inference) # 3 failed to output a score, total 197
- llama3-8b-it-prometheus-ko (korean train-> korean inference) # total 200
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6152b4b9ecf3ca6ab820e325/ssZRGTysyiOZD4ttNOD4s.png)
### **Citation**
```bibtex
@misc{kim2023prometheus,
title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
year={2023},
eprint={2310.08491},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Our trainig code can be found here: [TBD] |