kaleinaNyan/jina-v3-rullmarena-judge

JinaJudge: Proxy Judgement for Russian LLM Arena

Description

This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the Russian LLM Arena, designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

Model Details

Architecture: Utilizes a jina-embeddings-v3 encoder for feature extraction, followed by 4 transformer-decoder blocks.
Data Source: The training data was collected from the Russian LLM Arena. Data contradictions were filtered, and transitive examples were added for better generalization.
Judgement Classes: Though the original arena includes five judgement categories (A>>B, A>B, A=B, B>A, B>>A), the model consolidates them into three simplified classes:
- A > B
- A = B
- B > A
Training: The model underwent full-weight fine-tuning with the Adam optimizer over 30 epochs. A maximum sequence length of 4096 was set, and the best weights were chosen based on final performance.

Evaluation

The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

Models evaluated:

gemma-2-9b-it-sppo-iter3
glm-4-9b-chat
gpt-3.5-turbo-1106
mistral-7b-instruct-v0.3
storm-7b

Validation Performance:

Accuracy: 78.09%
Precision: 75.82%
Recall: 76.77%
F1-score: 76.27%

For the test phase, new judgements were generated using GPT-4 for the kolibri-mistral-0427-upd model.

Test Performance:

Accuracy: 80.07%
Precision: 76.68%
Recall: 77.73%
F1-score: 77.08%

Error Analysis

Upon reviewing erroneous predictions, the following observations were made:

Preference for English: In some cases, the model selects better English responses over superior Russian ones.
Difficulty with Paraphrasing: The model occasionally struggles with distinguishing between paraphrased responses.
Ambiguous Prompts: A significant portion of the errors arises from prompts in the Russian LLM Arena that don't allow for deterministic judgements, leading to noise in the evaluation data.

While there is potential to improve alignment between this model and GPT-4, achieving an accuracy beyond 85% is unlikely due to the inherent noise in the benchmarks.

Usage Example

from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])