CompassJudger-1
๐ค Hugging Face | ๐ค ModelScope | ๐ Paper | ๐๏ธ Leaderboard
Introduction
The CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.
- Comprehensive Evaluation Capabilities: CompassJudger-1 is capable of executing multiple evaluation methods, including but not limited to scoring, comparison, and providing detailed assessment feedback.
- Formatted Output: Supports outputting in a specific format as per instructions, facilitating further analysis and understanding of the evaluation results.
- Versatility: In addition to its evaluation functions, CompassJudger-1 can also act as a universal instruction model to accomplish daily tasks. It also supports model inference acceleration methods such as vLLM and LMdeploy.
Quick Start
Here provides a code to show you how to load the tokenizer and model and how to generate contents.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "opencompass/CompassJudger-1-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """your prompt"""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
We also provide some examples for different usage situations:
General Chat
**Input**: Hello, can you help me to judge something?
**Output**: Of course! I'd be happy to help you make a judgment or provide any assistance you need. Please tell me what you're looking to evaluate or understand.
Judge as Reward Model
**Input**: ```Please read the dialogue between the two assistants and the user to determine which assistant performed better during the conversation.Here is the dialogue content:
[Dialogue Begin]
User: What is a 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter?
Assistant A: Aardvark.
Assistant B: The word that meets the given criteria is \"adroit\".
User: \"D\" shouldn't be the second letter and the word must be a 5-letter word.
Assistant A: Aardvark.
Assistant B: I apologize for the confusion. A 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter is \"ardor\".
[Dialogue End]
If you believe Assistant A performed better, please output A directly.\nIf you believe Assistant B performed better, please output B directly.\nDo not output any other content, just the option. Please output:```
**Output**: B
Point-wise Judge
**Input**: ```ไฝ ๆฏไธไธชๆ
้ฟ่ฏไปทๆๆฌ่ดจ้็ๅฉๆใ\n่ฏทไฝ ไปฅๅ
ฌๆญฃ็่ฏๅค่
็่บซไปฝ๏ผ่ฏไผฐไธไธชAIๅฉๆๅฏนไบ็จๆทๆ้ฎ็ๅ็ญ็่ดจ้ใ็ฑไบๆจ่ฏไผฐ็ๅ็ญ็ฑปๅๆฏ่ง่ฒๆฎๆผ๏ผๅ ๆญคไฝ ้่ฆไปไธ้ข็ๅ ไธช็ปดๅบฆๅฏนๅ็ญ่ฟ่ก่ฏไผฐ:\n1. ไบๅฎๆญฃ็กฎๆง: ๅ็ญไธญๆไพ็ไฟกๆฏๆฏๅฆๅ็กฎๆ ่ฏฏ๏ผๆฏๅฆๅบไบๅฏไฟก็ไบๅฎๅๆฐๆฎใ\n2. ๆปก่ถณ็จๆท้ๆฑ: ๅ็ญๆฏๅฆๆปก่ถณไบ็จๆทๆๅบ้ฎ้ข็็ฎ็ๅ้ๆฑ๏ผๆฏๅฆๅฏน้ฎ้ข่ฟ่กไบๅ
จ้ข่ๆฐๅฝ็ๅๅบใ\n3. ้ป่พ่ฟ่ดฏๆง: ๅ็ญๆฏๅฆๅจๆดไฝไธไฟๆไธ่ด๏ผๆฏๅฆๅจไธๅ้จๅไน้ดไฟๆ้ป่พ่ฟ่ดฏๆง๏ผ้ฟๅ
ไบ่ช็ธ็็พใ\n4. ๅ้ ๆง: ๅ็ญๆฏๅฆๅ
ทๆๅๆฐๆงๆ็ฌ็นๆง๏ผๆฏๅฆๆไพไบๆฐ้ข็่ง่งฃๆ่งฃๅณๆนๆณใ\n5. ไธฐๅฏๅบฆ: ๅ็ญๅ
ๅซไธฐๅฏ็ไฟกๆฏใๆทฑๅบฆใไธไธๆ่่ใๅคๆ ทๆงใ่ฏฆ็ป่งฃ้ๅๅฎไพ๏ผไปฅๆปก่ถณ็จๆท้ๆฑๅนถๆไพๅ
จ้ข็่งฃใ\nๆไปฌไผ็ปๆจๆไพ็จๆท็ๆ้ฎ๏ผ้ซ่ดจ้็ๅ่็ญๆก๏ผๅ้่ฆไฝ ่ฏไผฐ็AIๅฉๆ็็ญๆกใๅฝไฝ ๅผๅงไฝ ็่ฏไผฐๆถ๏ผไฝ ้่ฆๆ็
ง้ตๅฎไปฅไธ็ๆต็จ๏ผ\n1. ๅฐAIๅฉๆ็็ญๆกไธๅ่็ญๆก่ฟ่กๆฏ่พ๏ผๆๅบAIๅฉๆ็็ญๆกๆๅชไบไธ่ถณ๏ผๅนถ่ฟไธๆญฅ่งฃ้ใ\n2. ไปไธๅ็ปดๅบฆๅฏนAIๅฉๆ็็ญๆก่ฟ่ก่ฏไปท๏ผๅจๆฏไธช็ปดๅบฆ็่ฏไปทไนๅ๏ผ็ปๆฏไธไธช็ปดๅบฆไธไธช1๏ฝ10็ๅๆฐใ\n3. ๆๅ๏ผ็ปผๅๆฏไธช็ปดๅบฆ็่ฏไผฐ๏ผๅฏนAIๅฉๆ็ๅ็ญ็ปๅบไธไธช1๏ฝ10็็ปผๅๅๆฐใ\n4. ไฝ ็ๆๅ้่ฆๅฐฝๅฏ่ฝไธฅๆ ผ๏ผๅนถไธ่ฆ้ตๅฎไธ้ข็่ฏๅ่งๅ๏ผๆป็ๆฅ่ฏด๏ผๆจกๅๅ็ญ็่ดจ้่ถ้ซ๏ผๅๅๆฐ่ถ้ซใๅ
ถไธญ๏ผไบๅฎๆญฃ็กฎๆงๅๆปก่ถณ็จๆท้ๆฑ่ฟไธคไธช็ปดๅบฆๆฏๆ้่ฆ็๏ผ่ฟไธคไธช็ปดๅบฆ็ๅๆฐไธปๅฏผไบๆๅ็็ปผๅๅๆฐใๅฝๆจกๅๅ็ญๅญๅจไธ้ฎ้ขไธ็ธๅ
ณ๏ผๆ่
ๆๆฌ่ดจๆง็ไบๅฎ้่ฏฏ๏ผๆ็ๆไบๆๅฎณๅ
ๅฎนๆถ๏ผๆปๅๅฟ
้กปๆฏ1ๅฐ2ๅ๏ผๅฝๆจกๅๅ็ญๆฒกๆไธฅ้้่ฏฏ่ไธๅบๆฌๆ ๅฎณ๏ผไฝๆฏ่ดจ้่พไฝ๏ผๆฒกๆๆปก่ถณ็จๆท้ๆฑ๏ผๆปๅไธบ3ๅฐ4ๅ๏ผๅฝๆจกๅๅ็ญๅบๆฌๆปก่ถณ็จๆท่ฆๆฑ๏ผไฝๆฏๅจ้จๅ็ปดๅบฆไธ่กจ็ฐ่พๅทฎ๏ผ่ดจ้ไธญ็ญ๏ผๆปๅๅฏไปฅๅพ5ๅฐ6ๅ๏ผๅฝๆจกๅๅ็ญ่ดจ้ไธๅ่็ญๆก็ธ่ฟ๏ผๅจๆๆ็ปดๅบฆไธ่กจ็ฐ่ฏๅฅฝ๏ผๆปๅๅพ7ๅฐ8ๅ๏ผๅชๆๅฝๆจกๅๅ็ญ่ดจ้ๆพ่่ถ
่ฟๅ่็ญๆก๏ผๅ
ๅๅฐ่งฃๅณไบ็จๆท้ฎ้ขๅๆๆ้ๆฑ๏ผๅนถไธๅจๆๆ็ปดๅบฆไธ้ฝๆฅ่ฟๆปกๅ็ๆ
ๅตไธ๏ผๆ่ฝๅพ9ๅฐ10ๅใไฝไธบ็คบไพ๏ผๅ่็ญๆกๅฏไปฅๅพๅฐ8ๅใ\n่ฏท่ฎฐไฝ๏ผไฝ ๅฟ
้กปๅจไฝ ๆๅๅ่ฟ่ก่ฏไปทๅ่งฃ้ใๅจไฝ ๅฏนๆฏไธช็ปดๅบฆ็่งฃ้ไนๅ๏ผ้่ฆๅ ไธๅฏน่ฏฅ็ปดๅบฆ็ๆๅใไนๅ๏ผๅจไฝ ๅ็ญ็ๆซๅฐพ๏ผๆ็
งไปฅไธๅญๅ
ธๆ ผๅผ๏ผๅ
ๆฌๆฌๅท๏ผ่ฟๅไฝ ๆๆ็ๆๅ็ปๆ๏ผๅนถ็กฎไฟไฝ ็ๆๅ็ปๆๆฏๆดๆฐ๏ผ\n{'็ปดๅบฆไธ': ๆๅ, '็ปดๅบฆไบ': ๆๅ, ..., '็ปผๅๅพๅ': ๆๅ}๏ผไพๅฆ๏ผ{'ไบๅฎๆญฃ็กฎๆง': 9, 'ๆปก่ถณ็จๆท้ๆฑ': 6, ..., '็ปผๅๅพๅ': 7}ใ\n็จๆท็ๆ้ฎ๏ผ ๆๆณ่ฎฉไฝ ๅๆฎๆๅไธไธช20ๅฒ็ๅฅณ็่ๅคฉ็่ๅคฉๆกใๆ่พๅ
ฅๅ20ๅฒๅฅณ็็่ๅคฉๅ
ๅฎน๏ผไฝ ไป
ไป
ๅช่พๅบไธไธช20ๅฒๅฅณ็ๅฏ่ฝไผๅๅค็ๅ
ๅฎน๏ผไธ่ฆ่พๅบไปปไฝๅคไฝ็ๆๆฌใไฝ ๅนณๆถๆไปไน็ฑๅฅฝๅ๏ผ\n[ๅ่็ญๆกๅผๅง]\nๆๅๆฌข้
่ฏป๏ผๅฐคๅ
ถๆฏๅฐ่ฏดๅ่ฏ่ฏ๏ผ่ฟๆๅฐฑๆฏๅๆฌขๅฌ้ณไนๅ็ป็ปใ\n[ๅ่็ญๆก็ปๆ]\n[ๅฉๆ็็ญๆกๅผๅง]\nๅๆฌขๅฌ้ณไนๅ็็ตๅฝฑใไฝ ๅข๏ผ\n[ๅฉๆ็็ญๆก็ปๆ]```
**Output**: ```ๅฐAIๅฉๆ็็ญๆกไธๅ่็ญๆก่ฟ่กๆฏ่พ๏ผๆๅบAIๅฉๆ็็ญๆกๆๅชไบไธ่ถณ๏ผๅนถ่ฟไธๆญฅ่งฃ้ใAIๅฉๆ็็ญๆกๆไพไบไธคไธช็ฑๅฅฝ๏ผๅฌ้ณไนๅ็็ตๅฝฑ๏ผๅนถไธไปฅไธ็งๅ้ฎๅฅ็ปๆ๏ผ่ฏข้ฎๅฏนๆน็็ฑๅฅฝใไธๅ่็ญๆก็ธๆฏ๏ผAIๅฉๆ็ๅ็ญๅจๅ
ๅฎนไธ่พไธบ็ฎๅ๏ผๆฒกๆๆไพ่ถณๅค็ไฟกๆฏๆฅๅฑ็คบๅ
ถไธฐๅฏๅบฆๅๅ้ ๆงใๅๆถ๏ผๅ่็ญๆกไธญๆๅฐ็โ้
่ฏปโๅโ็ป็ปโๆพ็คบไบๆดๅค็ๅคๆ ทๆงๅไธชๆงๅ็็ฑๅฅฝ๏ผ่ไธไป
ไป
ๆฏๅฌ้ณไนๅ็็ตๅฝฑใไปไธๅ็ปดๅบฆๅฏนAIๅฉๆ็็ญๆก่ฟ่ก่ฏไปท๏ผ
ไบๅฎๆญฃ็กฎๆง๏ผๅจ่ฟไธชๆ
ๅขไธญ๏ผไบๅฎๆญฃ็กฎๆงๅนถไธๆฏไธไธชไธป่ฆ็่ฏไผฐๆ ๅ๏ผๅ ไธบๅ็ญ็ๆฏๅ่ฎพ็ๅบๆฏใ็ถ่๏ผAIๅฉๆ็ๅ็ญๆฒกๆๆถๅไปปไฝไธๅ็กฎ็ไฟกๆฏ๏ผๅ ๆญคๅจ่ฟไธช็ปดๅบฆไธๆฏๅ้็ใๆๅ๏ผ8ใ
ๆปก่ถณ็จๆท้ๆฑ๏ผ็จๆทๅธๆๅพๅฐไธไธช20ๅฒๅฅณ็ๅฏ่ฝไผๅๅค็ๅ
ๅฎนใAIๅฉๆ็ๅ็ญๆไพไบไธไบ็ฑๅฅฝ๏ผไฝๆฏๅ็ญ็ๅ
ๅฎน่พไธบ็ฎๅ๏ผๆฒกๆๅ
ๅๆปก่ถณ็จๆทๅฏ่ฝๆๅพ
็ไธฐๅฏๆงๅๅคๆ ทๆงใๆๅ๏ผ5ใ
้ป่พ่ฟ่ดฏๆง๏ผAIๅฉๆ็ๅ็ญๅจ้ป่พไธๆฏ่ฟ่ดฏ็๏ผๆไพไบไธไธชๅ็็็ฑๅฅฝ๏ผๅนถไปฅไธ็ง่ช็ถ็ๆนๅผ็ปๆๅฏน่ฏใๆๅ๏ผ8ใ
ๅ้ ๆง๏ผAIๅฉๆ็ๅ็ญ็ผบไนๅ้ ๆง๏ผๆไพ็็ฑๅฅฝๆฏ่พๅธธ่ง๏ผๆฒกๆๅฑ็คบๅบๆฐ้ขๆ็ฌ็น็่ง่งฃใๆๅ๏ผ4ใ
ไธฐๅฏๅบฆ๏ผAIๅฉๆ็ๅ็ญ่พไธบ็ฎ็ญ๏ผ็ผบไนๆทฑๅบฆๅ็ป่๏ผๆช่ฝๆไพๅคๆ ทๅ็็ฑๅฅฝๆ่ฏฆ็ป็่งฃ้๏ผๅ ๆญคๅจไธฐๅฏๅบฆไธ่กจ็ฐ่พๅทฎใๆๅ๏ผ4ใ
็ปผๅๆฏไธช็ปดๅบฆ็่ฏไผฐ๏ผๅฏนAIๅฉๆ็ๅ็ญ็ปๅบไธไธช1๏ฝ10็็ปผๅๅๆฐใ
็ปผๅๆฅ็๏ผAIๅฉๆ็ๅ็ญๅบๆฌๆปก่ถณไบ็จๆท็้ๆฑ๏ผไฝ็ผบไนไธฐๅฏๅบฆๅๅ้ ๆง๏ผไธๅ่็ญๆก็ธๆฏ่ฟๆไธๅฎๅทฎ่ทใ็ปผๅๅพๅ๏ผ5ใ
{'ไบๅฎๆญฃ็กฎๆง': 8, 'ๆปก่ถณ็จๆท้ๆฑ': 5, '้ป่พ่ฟ่ดฏๆง': 8, 'ๅ้ ๆง': 4, 'ไธฐๅฏๅบฆ': 4, '็ปผๅๅพๅ': 5}```
Pair-wise Judge
**Input**: ```Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".\n<|User Prompt|>\nUse ABC notation to write a melody in the style of a folk tune.\n\n<|The Start of Assistant A's Answer|>\nX:1\nT:Untitled Folk Tune\nM:4/4\nL:1/8\nK:G\n|:G2A2|B2A2|G2E2|D4|E2F2|G2F2|E2C2|B,4|\nA2B2|c2B2|A2F2|E4|D2E2|F2E2|D2B,2|C4:|\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\nSure, I'll create a simple folk tune in ABC notation for you. ABC notation is a simple way to write down music using letters to represent notes and various symbols to represent timing and other musical elements. Here's a short, cheerful folk tune:\n\n\nX:1\nT:Folk Tune\nM:4/4\nL:1/4\nK:G\nGABc dedB|ABAF GFE2|GABc dedB|1 ABAG FGA2:|2 ABAG FGAB|\nc2ec B2dB|c2ec BAGB|c2ec B2dB|1 AGFE D3B:|2 AGFE D3z|]\n\n\nHere's a breakdown of what each part means:\n\n- `X:` The index number of the tune.\n- `T:` The title of the tune.\n- `M:` The meter, in this case, 4/4 time.\n- `L:` The default note length, in this case, 1/4.\n- `K:` The key of the tune, in this case, G major.\n\nThe tune itself is written in two parts, each ending with a repeat mark (`:|`). The first part (`GABc dedB|ABAF GFE2|...`) is repeated with a different ending (`1 ABAG FGA2:|2 ABAG FGAB|`) to add variety. The second part (`c2ec B2dB|c2ec BAGB|...`) also has a repeat with a different ending, and it ends with a long note (`D3B` or `D3z`) to give it a sense of finality.\n\nThis tune is designed to be simple and catchy, with a mix of stepwise motion and skips to give it a lively folk feel. You can play it on any melodic instrument, and it's suitable for singing as well if you add lyrics.\n<|The End of Assistant B's Answer|>\n```
**Output**: ```To generate a melody in the style of a folk tune using ABC notation, I would create a simple and repetitive melody, typical of folk music. Here is my attempt: X:1 T:Folk Tune Example M:4/4 L:1/8 K:D
|: D2 E2 | F2 E2 | D2 A2 | B4 |
| A2 B2 | c2 B2 | A2 F2 | E4 :|
| D2 E2 | f2 e2 | d2 A2 | G4 |
| F2 G2 | A2 G2 | F2 D2 | E4 :|
:|
Explanation: `X:` is the tune's index number. `T:` is the title of the tune. `M:` specifies the meter (4/4 time). `L:` sets the default note length to 1/8. `K:` indicates the key (D major).
The melody is constructed with simple intervals and repetitive phrases, which are characteristic of folk music. The tune has a clear structure with two verses and a repeat sign (`:`) to indicate repetition.
Now, let's evaluate the responses from Assistant A and Assistant B.
Assistant A's Answer: The melody provided by Assistant A is concise and follows the standard ABC notation. The choice of notes and structure is simple, which is fitting for a folk tune. The use of repetition and simple progression is consistent with folk music characteristics. The melody is clear and could be easily played by a musician.
Assistant B's Answer: Assistant B provides a detailed explanation of ABC notation, which is helpful for context. The melody includes more complex elements, such as varied repeat sections and additional symbols like `1` and `2`, which might not be typical for a straightforward folk tune. The explanation about the structure and the breakdown of each part is informative and adds value to the response. However, the use of `1` and `2` in the melody structure is unconventional and might confuse someone unfamiliar with ABC notation.
Evaluation: Both assistants provide a melody in ABC notation. However, Assistant B's response includes unnecessary complexity for a folk tune and might be confusing due to unconventional use of symbols. Assistant A's response is more straightforward and adheres closely to the expectations of a folk tune in ABC notation.
My final verdict is that Assistant A's answer is slightly better due to its simplicity and directness in creating a folk tune: [[A>B]].```
Response Critique
**Input**: ```ๅธฎๆ้ๅฏนไธ้ข็ๅ็ญ็ปๅบไฟฎๆน็ๆ่ง๏ผ็จๆท้ฎ้ข๏ผๅไธ้ฆ็งๅคฉ็่ฏ๏ผไธ่ฆๆ่ฝๅถไธคไธชๅญ๏ผไฝ่ฆ่ฝไฝ็ฐ่ฝๅถ็ๆ
ๆฏ ๆจกๅๅ็ญ๏ผ็ง้ฃ่ตท๏ผๆผซๅคฉ้ป๏ผ้ฃ้ถๆฏ้ฃ็ฆปๆๅฟใ
็ฟ ็ปฟๆพๆตไปๅทฒๆ๏ผๅฒๆๆ ๆ ๏ผ่ฐๆฟ่ฏ็ฆปไผคใ
ๆญฅๅฑฅๅพ๏ผๅฝฑ่ช้ฟ๏ผๆ ๆขข็จ็ๆ ๆ้ณใ
ๆพ็ป็นๅ็่ฟ็ผ๏ผๅฆไปๅฏๅฏ๏ผ้บๆปกๅค้ๆใ
ๅฒๆๆฒณ๏ผ้ๆๆ๏ผ่ฝป่ๅทฒ่ฟไธ้ๅฑฑใ
้ฃๅนดๅคๅคฉ๏ผ็ฌ่ฏญๆฌ๏ผ่ไปๆไฝ๏ผ่ฝๅถ็ๆฒ้ฆใ```
**Output**: ```่ฟ้ฆ่ฏ้ๅธธๆๆๅข๏ผๆดไฝไธๅพๅฅฝๅฐไฝ็ฐไบ็งๅคฉ็ๆฐๅดไปฅๅไธ่ฝๅถ็ธๅ
ณ่็็ฆปๅซไธๅ่ฟๆใไธ่ฟ็จๆท่ฆๆฑไธญ็นๅซๅผบ่ฐไบโไธ่ฆๆ่ฝๅถไธคไธชๅญ๏ผไฝ่ฆ่ฝไฝ็ฐ่ฝๅถ็ๆ
ๆฏโใๅบไบๆญค๏ผๆไปฌๅฏไปฅๅฏน่ฏไฝ่ฟ่กไธไบ่ฐๆด๏ผไปฅๆดๅ ่ดดๅ็จๆท็้ๆฑใ่ฟ้็ปๅบไธไบๅปบ่ฎฎ๏ผ
โ็ง้ฃ่ตท๏ผๆผซๅคฉ้ป๏ผ้ฃ้ถๆฏ้ฃ็ฆปๆๅฟใโ
ๅฏไปฅๆนไธบ๏ผโ็ง้ฃ่ตท๏ผๆผซๅคฉ้ป๏ผ็ฉบไธญ่ๅจ็ฆปๆๅฟใโ
่ฟๆ ทๅฐฑ้ฟๅ
ไบ็ดๆฅไฝฟ็จโ้ฃ้ถโไธ่ฏ๏ผๅๆถไพ็ถไผ ่พพไบๅถๅญไปๆ ไธ้ฃ่ฝ็ๆ
ๆฏใ
โ็ฟ ็ปฟๆพๆตไปๅทฒๆ๏ผๅฒๆๆ ๆ ๏ผ่ฐๆฟ่ฏ็ฆปไผคใโ
ๅฏไปฅๆนไธบ๏ผโ็ฟ ็ปฟๆพๆตไปๅทฒๆ๏ผๅฒๆๆ ้ฟ๏ผๆ
ไบๆๆ ๅฃฐใโ
่ฟๆ ทๆนๅจๅ๏ผ่ฝ็ถๆฒกๆ็ดๆฅ่กจ่พพ็ฆปไผค๏ผไฝไพ็ถ่ฝ่ฎฉไบบๆๅๅฐ้็ๆถ้ด็ๆต้๏ผๆพ็ป็็นๅๅทฒ็ป้ๅป็ๅฏๅฏฅใ
โๆญฅๅฑฅๅพ๏ผๅฝฑ่ช้ฟ๏ผๆ ๆขข็จ็ๆ ๆ้ณใโ
่ฟไธๅฅๅทฒ็ปๅพๅฅฝๅฐๆ็ปไบๅๆๆถๅ็ๅฎ้ไธๅญค็ฌ๏ผๅฏไปฅไฟๆไธๅใ
โๆพ็ป็นๅ็่ฟ็ผ๏ผๅฆไปๅฏๅฏ๏ผ้บๆปกๅค้ๆใโ
ๅฏไปฅๆนไธบ๏ผโๅพๆ็นๅๆ่ฟฝๅฟ๏ผๅฆไปๅฏๅฏ๏ผๅค้ๆ ๆ้ณใโ
่ฟๆ ทๆนๅจๅ๏ผๆข่กจ่พพไบๆพ็ป็็นๅ๏ผๅ็ชๅบไบๅฝๅ็ๅฏๅฏไธ่ฝๅฏใ
โๅฒๆๆฒณ๏ผ้ๆๆ๏ผ่ฝป่ๅทฒ่ฟไธ้ๅฑฑใโ
่ฟไธๅฅๅฏไปฅไฟๆไธๅใ
โ้ฃๅนดๅคๅคฉ๏ผ็ฌ่ฏญๆฌ๏ผ่ไปๆไฝ๏ผ่ฝๅถ็ๆฒ้ฆใโ
ๅฏไปฅๆนไธบ๏ผโ้ฃๅนดๅคๅคฉ๏ผ็ฌ่ฏญๆฌ๏ผ่ไปๆไฝ๏ผๆฒ้ฆๆผซๆผซใโ
้ฟๅ
ไบ็ดๆฅๆๅฐโ่ฝๅถโ๏ผๅๆถไฟ็ไบ็งๅคฉ็ๆๆไธๅฒๆ็ๅ่ฟใ
ไฟฎๆนๅ็่ฏไฝๅฆไธ๏ผ
็ง้ฃ่ตท๏ผๆผซๅคฉ้ป๏ผ็ฉบไธญ่ๅจ็ฆปๆๅฟใ
็ฟ ็ปฟๆพๆตไปๅทฒๆ๏ผๅฒๆๆ ้ฟ๏ผๆ
ไบๆๆ ๅฃฐใ
ๆญฅๅฑฅๅพ๏ผๅฝฑ่ช้ฟ๏ผๆ ๆขข็จ็ๆ ๆ้ณใ
ๆพ็ป็นๅๆ่ฟฝๅฟ๏ผๅฆไปๅฏๅฏ๏ผๅค้ๆ ๆ้ณใ
ๅฒๆๆฒณ๏ผ้ๆๆ๏ผ่ฝป่ๅทฒ่ฟไธ้ๅฑฑใ
้ฃๅนดๅคๅคฉ๏ผ็ฌ่ฏญๆฌ๏ผ่ไปๆไฝ๏ผๆฒ้ฆๆผซๆผซใ
่ฟ้ฆไฟฎๆนๅ็่ฏไพ็ถๆ็ปไบ็งๅคฉ็ๆฏ่ฒไธๅญฃ่ๅธฆๆฅ็ๆ็ปช๏ผๅๆถๅทงๅฆๅฐ้ฟๅ
ไบไฝฟ็จโ่ฝๅถโไธ่ฏ๏ผ่้่ฟๅ
ถไปๆ่ฑกไผ ่พพไบ่ฝๅถ็ๆ
ๆฏใ```
JudgerBench
We have also established a new benchmark named JudgerBench, aimed at standardizing the evaluation capabilities of different judging models, thereby helping to identify more effective evaluator models.
To test your judge model on JudgerBench, please follow below code with Opencompass:
Change the models to your models in configs/eval_judgerbench.py
then run
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
python run.py configs/eval_judgerbench.py --mode all --reuse latest
We also provided a leaderboard for JudgerBench: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard
If you want to add your model to this leaderboard, welcome to add an issue in this Repository.
Use CompassJudger-1 to Test Subjective Datasets in OpenCompass
If you wish to evaluate common subjective datasets using CompassJudger-1 in Opencompass, take the evaluation of Alignbench as an example. Please follow the code below:
You need to setup three items first:
- 1.datasets (The subjective datasets you want to test)
- 2.models (The models you want to test on the subjective datasets)
- 3.judge_models (Which judge models you want to use as evaluator)
For more settings, please refer to the advanced guidance in OpenCompass.
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI, TurboMindModelwithChatTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
# -------------Inference Stage ----------------------------------------
models = [*lmdeploy_qwen2_5_1_5b_instruct] # add models you want
datasets = [*alignbench_datasets] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [dict(
dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-7B-Instruct',
path='opencompass/CompassJudger-1-7B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
)]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'
Then run:
python run.py configs/eval_subjective.py --mode all --reuse latest
For more detailed subjective evaluation guidelines, please refer to: https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/subjective_evaluation.md
Subjective Evaluation Leaderboard by CompassJudger-1
To facilitate better comparisons within the community, we have tested the subjective performance of some models using CompassJudger-1.
See in: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard
If you want to add your model to this leaderboard, welcome to add an issue in this Repository.
Citation
@article{cao2024compass,
title={CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution},
author={Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen},
journal={arXiv preprint arXiv:2410.16256},
year={2024}
}
Acknowledge
- Downloads last month
- 281