Medium model with surprising results

by JoseLuisNeves - opened 10 days ago

10 days ago

I am loading the medium model as such:
phi_medium_model = AutoModelForCausalLM.from_pretrained("EmergentMethods/Phi-3-medium-128k-instruct-graph", torch_dtype="auto", device_map = 'auto', trust_remote_code=True, attn_implementation="eager")
phi_medium_tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-medium-128k-instruct-graph")
pipe_medium = pipeline(
"text-generation",
model = phi_medium_model,
tokenizer = phi_medium_tokenizer,
max_new_tokens=200
)

To perform the inference, I am using the following function:
def phi_kg_extraction(sentence: str, model: str):
messages = [
{"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

  The User provides text in the format:

  -------Text begin-------
  <User provided text>
  -------Text end-------

  The Assistant follows the following steps before replying to the User:

  1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

  "nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

  where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

  2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

  "edges":[{"source": <entity 1>, "target": <entity 2>, "relation": <relationship>}, ...]

  The <entity N> must correspond to the "id" of an entity in the "nodes" list.

  The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
  The Assistant responds to the User in JSON only, according to the following JSON schema:

  {"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"source":{"type":"string"},"target":{"type":"string"},"relation":{"type":"string"}},"required":["source","target","relation"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
  
      """},
      {"role": "user", "content": f"""
  -------Text begin-------
  {sentence}
  -------Text end-------
  """}
  ]
  generation_args = {
        "return_full_text": False,
        "temperature": 0.1,
        "do_sample": False
    }
  print(f"Processing the sentence: {sentence}")
  if "mini" in model:
    output = pipe_mini(messages, **generation_args)
  elif "medium" in model:
    output = pipe_medium(messages, **generation_args)
  ents, rels = convert_phi_kg(output[0]['generated_text'])
  return ents, rels

However, the results I am obtaining are odd. Instead of nodes and edges, the model is outputting in an unstructured text format as such:
Input Sentence: pulmonary vessels are normal
Output: The provided text indicates that the pulmonary vessels are normal. This suggests that there are no abnormalities or issues detected in the blood vessels of the lungs, which are responsible for carrying oxygen-rich blood from the lungs to the heart and oxygen-poor blood from the heart to the lungs. A normal pulmonary vessel condition is essential for efficient gas exchange and overall respiratory health.

What am I doing wrong?

rcaulk

Emergent Methods org 10 days ago

Hello, thanks for the report.

I believe this was trained without a system message, so you should try running it by combining your current "system" and "user" messages into just one "user" message.

You can see the example of how we do that in the model card, let me know if it helps or if it does not work after changing your messages list.

JoseLuisNeves

10 days ago

Thank you, just tried it but the outcome is the same. What I find weird is that the mini version is working properly with the same analogous setup

rcaulk

Emergent Methods org 10 days ago

Thanks, @wagnercosta do you have any sense of why the medium might not be properly generating a graph?

wagnercosta

Emergent Methods org 4 days ago

I am loading the medium model as such:
phi_medium_model = AutoModelForCausalLM.from_pretrained("EmergentMethods/Phi-3-medium-128k-instruct-graph", torch_dtype="auto", device_map = 'auto', trust_remote_code=True, attn_implementation="eager")
phi_medium_tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-medium-128k-instruct-graph")
pipe_medium = pipeline(
"text-generation",
model = phi_medium_model,
tokenizer = phi_medium_tokenizer,
max_new_tokens=200
)

To perform the inference, I am using the following function:
def phi_kg_extraction(sentence: str, model: str):
messages = [
{"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.
  The User provides text in the format:

  -------Text begin-------
  <User provided text>
  -------Text end-------

  The Assistant follows the following steps before replying to the User:

  1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

  "nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

  where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

  2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

  "edges":[{"source": <entity 1>, "target": <entity 2>, "relation": <relationship>}, ...]

  The <entity N> must correspond to the "id" of an entity in the "nodes" list.

  The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
  The Assistant responds to the User in JSON only, according to the following JSON schema:

  {"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"source":{"type":"string"},"target":{"type":"string"},"relation":{"type":"string"}},"required":["source","target","relation"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
  
      """},
      {"role": "user", "content": f"""
  -------Text begin-------
  {sentence}
  -------Text end-------
  """}
  ]
  generation_args = {
        "return_full_text": False,
        "temperature": 0.1,
        "do_sample": False
    }
  print(f"Processing the sentence: {sentence}")
  if "mini" in model:
    output = pipe_mini(messages, **generation_args)
  elif "medium" in model:
    output = pipe_medium(messages, **generation_args)
  ents, rels = convert_phi_kg(output[0]['generated_text'])
  return ents, rels
However, the results I am obtaining are odd. Instead of nodes and edges, the model is outputting in an unstructured text format as such:
Input Sentence: pulmonary vessels are normal
Output: The provided text indicates that the pulmonary vessels are normal. This suggests that there are no abnormalities or issues detected in the blood vessels of the lungs, which are responsible for carrying oxygen-rich blood from the lungs to the heart and oxygen-poor blood from the heart to the lungs. A normal pulmonary vessel condition is essential for efficient gas exchange and overall respiratory health.

What am I doing wrong?

Hey Jose,

This model doesn't have a system prompt. However, the example code available on the model card works correctly. Below is an example using the same input you mentioned.

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 
from dotenv import load_dotenv

load_dotenv()

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "EmergentMethods/Phi-3-medium-128k-instruct-graph",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-medium-128k-instruct-graph") 

messages = [ 
    {"role": "user", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

The User provides text in the format:

-------Text begin-------
<User provided text>
-------Text end-------

The Assistant follows the following steps before replying to the User:

1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

"nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

"edges":[{"from": <entity 1>, "to": <entity 2>, "label": <relationship>}, ...]

The <entity N> must correspond to the "id" of an entity in the "nodes" list.

The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
The Assistant responds to the User in JSON only, according to the following JSON schema:

{"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"from":{"type":"string"},"to":{"type":"string"},"label":{"type":"string"}},"required":["from","to","label"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}

Input: 
-------Text begin-------
pulmonary vessels are normal
-------Text end-------
     """}
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

The output:

{
    "nodes": [
        {
            "id": "pulmonary vessels",
            "type": "anatomical structure",
            "detailed_type": "blood vessels in the lungs"
        },
        {
            "id": "normal",
            "type": "condition",
            "detailed_type": "healthy state"
        }
    ],
    "edges": [
        {
            "from": "pulmonary vessels",
            "to": "normal",
            "label": "are in"
        }
    ]
}

JoseLuisNeves

3 days ago

Thank you. Just out of curiosity why does mini instruct graph use system prompt and medium uses user prompt?

wagnercosta

Emergent Methods org 3 days ago

Each model has a chat format. In the case of phi3-medium, Microsoft released it only in the user/assistant format (without the system), which is not the case for phi3-mini. You can see the difference in the two links below:

With system / user / assistant:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#chat-format
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/tokenizer_config.json#L119 (chat template)

With user / assistant:
https://huggingface.co/microsoft/Phi-3-medium-128k-instruct#chat-format
https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/blob/main/tokenizer_config.json#L119 (chat template)

This has to do with the chat format chosen to fine-tune the model for instruction by those who developed the model (Microsoft).
In some cases, they release updates that change the chat format to include the system as well (they did this with some versions of Phi-3).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment