File size: 10,299 Bytes

---
language:
- en
- ko
license: llama3
library_name: transformers
tags:
- ko
- eval
- llm-eval
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
datasets:
- nayohan/feedback-collection-ko
- nayohan/feedback-collection-ko-chat
pipeline_tag: text-generation
---

# **Introduction**
This model translated the [prometheus-eval/Feedback-Collection](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) dataset into Korean and trained on the llama3-8b-it model.
Train Dataset: [nayohan/feedback-collection-ko](https://huggingface.co/datasets/nayohan/feedback-collection-ko)

### **Loading the Model**

Use the following Python code to load the model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nayohan/llama3-8b-it-prometheus-ko"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map="auto",
  torch_dtype=torch.bfloat16
)
```

### **Generating Text**
System prompt is fixed, and you can set the score rubric according to the given task, and then change the orig_instruction, orig_response, and orig_reference_answer to evaluate it.
```python
system_prompt = """###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations."""

sample = {
  'orig_instruction': "나는 첨단 기술 프로젝트를 진행하는 팀에 있다. 그러나 최근 프로젝트 방향을 놓고 팀원들 사이에 지속적인 갈등이 발생하고 있다. 한 그룹은 급진적이고 위험하지만 잠재적으로 게임을 바꿀 수 있는 접근법을 강력하게 옹호하고 있다. 대조적으로, 다른 그룹은 보다 측정되고 더 안전하며 입증된 전략을 선호한다. 결과적으로 우리 팀은 분열되어 진전을 이룰 수 없다. 우리의 대화를 중재하고 해결을 이끌어낼 수 있는 AI 모델이 필요하다. 이러한 상황에 대응하여 AI 모델은 무엇을 말해야 하는가?",
  'orig_response': "그러니까 프로젝트 방향에 합의가 안 되는 팀에 있는 거 아니야? 다들 잘 맞도록 배워야 할 것 같네요. 어쩌면 동전을 던지고 어느 쪽이 승리하는지 봐야 할 것 같아요. 그렇게 하면 논쟁이 없고 모두가 일터로 돌아갈 수 있습니다. 위험하든 안전하든 상관없어요. 하나를 골라서 그냥 가세요. 게다가, 모든 것이 무너지면 서로 비난하고 넘어갈 수 있습니다. 아니면 더 좋은 것은, 어떤 그룹의 아이디어가 더 나은지 보기 위한 경쟁이 왜 안 돼? 패배자는 우승자를 위해 점심을 사야 해요.",
  'orig_reference_answer': "이 팀의 모든 사람들이 프로젝트에 열정적이고 성공하기를 원한다는 것은 분명하며, 이는 모든 해결의 훌륭한 출발점이다. 또한 갈등은 위험과 혁신에 대한 서로 다른 관점에서 발생한다는 것도 분명합니다. 둘 다 프로젝트의 성공에 중요한 고려 사항입니다. 두 접근법 모두에서 유효한 점을 인정하는 것으로 시작하겠습니다. 급진적인 접근법을 옹호하는 팀은 높은 보상과 획기적인 혁신의 잠재력에 의해 주도되며, 이는 모든 첨단 프로젝트에서 훌륭하고 필수적입니다.",
  'orig_criteria':'모형은 대화에서 갈등 해결을 얼마나 효과적으로 처리하는가?',
  'orig_score1_description':'모델은 갈등이나 오해를 가중시켜 문제를 중재하거나 해결할 수 있는 능력을 보이지 않는다.',
  'orig_score2_description':'이 모델은 갈등에 대한 인식이 있지만 이를 해결하려는 시도는 효과가 없거나 잘못된 지침을 가지고 있다.',
  'orig_score3_description':'이 모델은 갈등을 적당히 처리하여 일부 성공적인 해결 전술을 보여주지만 더 일관성이 있을 수 있다.',
  'orig_score4_description':'이 모델은 갈등을 잘 처리하여 긴장을 확산시키고 해결을 효과적으로 안내하지만 미세한 미끄럼이 있습니다.',
  'orig_score5_description':'이 모델은 갈등을 훌륭하게 관리하고, 지속적으로 긴장을 확산시키며, 대화를 타협으로 안내하고 긍정적인 대화 환경을 조성한다.',
  'orig_feedback': '제공된 응답은 당면한 문제를 조정하거나 해결하는 능력을 보여주지 않는다. 대신 팀의 우려를 사소화하고 잠재적인 결과에 대한 고려 없이 동전을 던지거나 대회를 개최하는 것과 같은 비건설적 솔루션을 제안한다. 또한 응답은 상황이 잘못되면 팀 구성원들이 서로를 비난해야 한다는 것을 암시한다. 갈등을 더욱 악화시킨다. 건설적인 대화를 장려하거나 두 접근법 사이의 중간 지점을 찾는 것의 중요성을 인정하지 않는다. 따라서 전체 점수는 1이다.',
  'orig_score': 1,
}

instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
  ###Response to evaluate: {sample['orig_response']}
  ###Reference Answer (Score 5): {sample['orig_reference_answer']}
  ###Score Rubrics: [{sample['orig_criteria']}]
  Score 1: {sample['orig_score1_description']}
  Score 2: {sample['orig_score2_description']}
  Score 3: {sample['orig_score3_description']}
  Score 4: {sample['orig_score4_description']}
  Score 5: {sample['orig_score5_description']}
  ###Feedback:"""

# for training
# output = f"""{sample['orig_feedback']}
#   [RESULT] {sample['orig_score']}"""
    
conversation = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": instruction},
            # {"role": "assistant", "content": output}
        ]

input_ids = tokenizer.apply_chat_template(
  conversation,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors='pt'
).to("cuda")

output = model.generate(input_ids, max_new_tokens=512)
output_text = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(output_text)
```
If you don't have a reference text, it can work without one. The model evaluates orig_response, the sentence after orig_instruction. Use the following template code.
```python
instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
  ###Response to evaluate: {sample['orig_response']}
  ###Score Rubrics: [{sample['orig_criteria']}]
  Score 1: {sample['orig_score1_description']}
  Score 2: {sample['orig_score2_description']}
  Score 3: {sample['orig_score3_description']}
  Score 4: {sample['orig_score4_description']}
  Score 5: {sample['orig_score5_description']}
  ###Feedback:"""
```
The model was trained with feedback truncated, feedback can sometimes be represented with some truncation.
```
# Result with orig_reference_answer 
# OUTPUT: 이 대응은 갈등 해결에 대한 이해가 부족함을 보여준다. 동전을 던지거나 경쟁을 제안하는 것과 같이 제공된 제안은 문제의 복잡성을 무시하고 팀 내의 다양한 관점을 무시한다. 응답은 두 접근법의 잠재적 가치를 인정하지 않으며 팀 구성원 간의 이해와 존중을 촉진하지도 않는다. 또한 응답은 팀의 열정과 프로젝트에 대한 헌신을 인정하지 않는다. 따라서 전체 점수는 1이다.
    [RESULT] 1
# Result without orig_reference_answer 
# OUTPUT: 대응은 갈등 해결에 대한 이해를 나타내지 않는다. AI 모델은 갈등을 해결하기보다는 갈등을 악화시키는 것을 제안하며, 이는 점수 루브릭에 따라 요구 사항에 어긋난다. 동전을 던지고 경쟁을 제안하는 것은 팀 구성원 간의 긴장을 확산시키는 데 도움이 되지 않고 오히려 더 많은 갈등을 촉발할 수 있다. 또한, 팀 구성원이 더 나은 아이디어를 갖는 것이 아니라 "더 나은" 아이디어를 갖는다는 것을 암시하는 것은 팀 구성원 간의 화합을 촉진하지 않는다. 따라서 전체 점수는 1이다.
    [RESULT] 1
```
If you just want to get a score from the evaluation, you can use the following extract_score function.
```python
import re
def extract_score(text):
    pattern = re.compile(r'\[RESULT\]\s+([0-5])')
    match = pattern.search(text)
    if match:
        score = int(match.group(1))
    else: score=0
    return score

predict_score = extract_score(output_text)
print(predict_score) # 1
```

### **Heatmap Visualize**
[eng->eng] we randomly sampled 200 evalset from the [training data](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection), extracted scores from the model-generated sentences, and compared them to the correct answers. The training and test datasets are not separated, so we can only see how well the model learned.

[ko->ko] sampled 200 evalset in this [testset](https://huggingface.co/datasets/nayohan/feedback-collection-ko-chat/viewer/default/test). llama3-8b-it-prometheus-ko only use train set.

- prometheus-7b-v1.0 (english train-> english inference) # 3 failed to output a score, total 197
- llama3-8b-it-prometheus-ko (korean train-> korean inference) # total 200 

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6152b4b9ecf3ca6ab820e325/ssZRGTysyiOZD4ttNOD4s.png)

### **Citation**
```bibtex
@misc{kim2023prometheus,
    title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
    author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
    year={2023},
    eprint={2310.08491},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```
Our trainig code can be found here: [TBD]