Phudge-3. Phi-3 as Scalable Judge

A robust production grade and scalable SOTA (4 Benchmarks) model for Relative and Absolute grading of LLM (as well human) responses.

Given a question and it's response, it can judge the quality of response from a scale of 1-5. It is trained to be used in Absolute (1 Question - 1 Answer) but can be used as Relative task too. It is supposed to work on Reference free settings too. So you can use it as following:

Question + Response to evaluate
Question + Response to evaluate + Custom Rubric (scoring criteria for your business use case)
Question + Response to evaluate + Custom Rubric + Reference Answer (A high Quality Answer which serves as the base)

Model adapted from https://github.com/deshwalmahesh/PHUDGE to make it compatible with HuggingFace Hub.

Example usage

from transformers import AutoTokenizer, Phi3ForSequenceClassification
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("vicgalle/Phudge-3", trust_remote_code=True)
model = Phi3ForSequenceClassification.from_pretrained("vicgalle/Phudge-3", trust_remote_code=True, 
                                                      torch_dtype=torch.bfloat16, device_map="cuda")

def predict(model, tokenizer, test_data, MAX_LENGTH=1656, BATCH_SIZE=1):
    results = []
    with torch.no_grad():
            
        batches = [test_data[i:i + BATCH_SIZE] for i in range(0, len(test_data), BATCH_SIZE)]
        for batch in batches:
            inputs = tokenizer(batch, truncation= True, max_length=MAX_LENGTH, padding="max_length", 
                        return_tensors = "pt").to(model.device)

            logits = model(**inputs).logits.cpu().to(torch.float32)
            scores = np.clip(logits.numpy(), 1,5).tolist()
            results.extend(scores)

    return results

TEXT = """<|system|>
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.<|end|>

<|user|>
###The instruction to evaluate:
I'm working on a project that involves creating a user-friendly chatbot for a digital library. The users should be able to ask the chatbot for book recommendations based on their preferences. However, users' preferences can be vague or ambiguous. For instance, a user might say "I want a book like Harry Potter but different", "I liked the character development in Pride and Prejudice, suggest something similar", or "Do you have a book that's exciting and thought-provoking but not too difficult to read?". How can the chatbot handle such ambiguous or vague inputs to provide accurate book recommendations?

###Response to evaluate:
To handle ambiguous or vague inputs, the chatbot should be able to interpret the user's preferences based on context and keywords. For example, if a user wants a book like Harry Potter but different, the chatbot could suggest fantasy books with different plots or characters. If a user likes character development in Pride and Prejudice, the chatbot could recommend novels with similar themes or well-developed characters.

In some cases, the chatbot might misunderstand the user's intent or need to ask for clarification. For instance, if a user asks for a book that's exciting and thought-provoking but not too difficult to read, the chatbot may suggest a thriller novel, but it could also ask the user for more details about their preferences.

Overall, the chatbot should aim to provide accurate book recommendations based on the user's input, but it might occasionally misinterpret their preferences or need to ask for more information to provide a suitable suggestion.

###Reference Answer (Score 5):
To handle ambiguous or vague inputs, the chatbot should be designed with a high level of natural language understanding and processing. This includes the ability to interpret semantics, context, and sentiment in user input.

1. **Contextual Understanding:** The chatbot should be able to understand and relate to the context provided by the user. For instance, if a user says they want a book like Harry Potter but different, the chatbot could interpret this as the user wanting a book in the fantasy genre but with a different storyline or writing style. 

2. **Semantics Interpretation:** If a user mentions they enjoyed the character development in 'Pride and Prejudice', the bot should understand that the user is likely interested in novels with well-rounded, evolving characters and perhaps, a focus on interpersonal relationships and societal norms.

3. **Sentiment Analysis:** The chatbot should be able to detect sentiment in the user's input. If a user asks for something 'exciting and thought-provoking but not too difficult to read', the chatbot should understand that the user likely wants a book that is engaging and intellectually stimulating, but not too complex or dense in writing style.

In cases where the user's input is too vague or ambiguous, the chatbot should be programmed to ask follow-up questions in a natural and conversational manner. For example, if a user says they want a book like Harry Potter but different, the chatbot could ask, "Could you please specify in what way you'd like it to be different? Are you looking for a different genre, writing style, or narrative structure?"

By adopting these strategies, the chatbot can effectively handle ambiguity and vagueness in user input, provide accurate responses, and ensure a pleasant user experience.

###Score Rubrics:
[How well does the model handle ambiguity and vagueness in the user's input?]
Score 1: The model cannot handle ambiguous or vague inputs, often providing responses that are irrelevant or nonsensical.
Score 2: The model struggles with ambiguous or vague inputs, providing accurate responses some of the time, but often misinterpreting the user's intent.
Score 3: The model generally handles ambiguous or vague inputs well, but occasionally misinterprets the user's intent or asks for clarification.
Score 4: The model handles ambiguous or vague inputs well, usually providing accurate responses and occasionally asking for clarification when necessary.
Score 5: The model expertly handles ambiguous or vague inputs, always interpreting the user's intent correctly and asking for clarification in a natural and conversational manner when necessary.<|end|>

<|assistant|>"""


predict(model, tokenizer, [TEXT])

The previous inference should return the following numerical grade:

[[2.90625]]