kaleinaNyan/jina-v3-rullmarena-judge-041024

JinaJudge: Proxy Judgement for Russian LLM Arena

Description

This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the Russian LLM Arena, designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

Model Details

This is an iterative update of kaleinaNyan/jina-v3-rullmarena-judge-300924 model:

Increased amount of training data (not by much, approaximately 1.5x times).
Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
Validation set was updated as well to exclude such errors.
Test set did not change (no bad judgements in that regard).

Evaluation

The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

Models evaluated:

gemma-2-9b-it-sppo-iter3
glm-4-9b-chat
gpt-3.5-turbo-1106
mistral-7b-instruct-v0.3
storm-7b

Validation Performance (old validation set):

Accuracy: 79.97% (-0.78)
Precision: 78.25% (-0.31)
Recall: 78.25% (-1.23)
F1-score: 78.25% (-0.75)

NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

Validation Performance (new validation set):

Accuracy: 83.59% (+2.48)
Precision: 80.97% (+2.14)
Recall: 80.97% (+1.22)
F1-score: 80.97% (+1.77)

For the test phase, new judgements were generated using GPT-4 for the kolibri-mistral-0427-upd model.

Test Performance:

Accuracy: 85.09% (+2.37)
Precision: 83.20% (+3.09)
Recall: 83.20% (+0.78)
F1-score: 83.20% (+2.02)

Usage Example

from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])

Generated ranking

The ranking was obtained using a modified Russian LLM Arena code. All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090.

Model	Score	95% CI	Average #Tokens
gpt-4-1106-preview	82.8	(-2.2, 2.3)	541
gpt-4o-mini	75.3	(-2.5, 2.9)	448
qwen-2.5-72b-it	73.1	(-3.4, 3.1)	557
gemma-2-9b-it-sppo-iter3	70.6	(-3.9, 2.8)	509
gemma-2-27b-it	68.7	(-2.8, 3.8)	472
t-lite-instruct-0.1	67.5	(-3.8, 3.8)	810
gemma-2-9b-it	67.0	(-3.7, 3.3)	459
suzume-llama-3-8B-multilingual-orpo-borda-half	62.4	(-3.5, 3.7)	682
glm-4-9b-chat	61.5	(-3.7, 3.0)	568
phi-3-medium-4k-instruct	60.4	(-3.5, 3.7)	566
sfr-iterative-dpo-llama-3-8b-r	57.2	(-3.9, 2.2)	516
c4ai-command-r-v01	55.0	(-3.9, 3.1)	529
suzume-llama-3-8b-multilingual	51.9	(-2.8, 3.7)	641
mistral-nemo-instruct-2407	51.9	(-3.8, 3.7)	403
yandex_gpt_pro	50.3	(-3.4, 3.1)	345
gpt-3.5-turbo-0125	50.0	(0.0, 0.0)	220
hermes-2-theta-llama-3-8b	49.3	(-3.4, 3.9)	485
starling-lm-7b-beta	48.3	(-3.8, 4.0)	629
llama-3-8b-saiga-suzume-ties	47.9	(-3.9, 5.0)	763
llama-3-smaug-8b	47.6	(-3.6, 3.1)	524
vikhr-it-5.4-fp16-orpo-v2	46.8	(-2.5, 2.7)	379
aya-23-8b	46.1	(-3.9, 3.9)	554
saiga_llama3_8b_v6	44.8	(-3.4, 3.3)	471
qwen2-7b-instruct	43.6	(-3.0, 2.7)	340
vikhr-it-5.2-fp16-cp	43.6	(-4.1, 3.3)	543
openchat-3.5-0106	42.8	(-3.9, 3.3)	492
kolibri-mistral-0427-upd	42.3	(-4.2, 3.2)	551
paralex-llama-3-8b-sft	41.8	(-3.2, 3.7)	688
llama-3-instruct-8b-sppo-iter3	41.7	(-3.4, 3.3)	502
gpt-3.5-turbo-1106	41.5	(-2.9, 2.1)	191
mistral-7b-instruct-v0.3	41.1	(-4.3, 3.5)	469
gigachat_pro	40.9	(-3.4, 3.6)	294
openchat-3.6-8b-20240522	39.1	(-3.2, 4.1)	428
vikhr-it-5.3-fp16-32k	38.8	(-3.5, 3.3)	519
hermes-2-pro-llama-3-8b	38.4	(-3.2, 3.1)	463
kolibri-vikhr-mistral-0427	34.5	(-2.9, 3.5)	489
vikhr-it-5.3-fp16	33.5	(-3.5, 3.8)	523
llama-3-instruct-8b-simpo	32.7	(-3.9, 3.6)	417
meta-llama-3-8b-instruct	32.1	(-3.4, 3.3)	450
neural-chat-7b-v3-3	25.9	(-2.7, 3.6)	927
gigachat_lite	25.4	(-2.8, 2.5)	276
snorkel-mistral-pairrm-dpo	10.3	(-2.0, 2.3)	773
storm-7b	3.7	(-1.3, 1.6)	419