Cannot use transformer library to inference the

#40
by manishbaral - opened

I was using the transformer library but was unable to inference the model.
I followed the example there but was unable to inference.
I used AutoModelForCausalLm as well as pipeline and AutoModel

Hi. Yes, we are working on adding our new model to the transformers library. Until it's merged, please use the model class we shared with the model checkpoint. The provided example should work if you downloaded the model and class file (which is done automatically) in the same directory.

I saw this one and downloaded the model and tried inference but didnot worked.

The model loads without any error but it continues to processing without throwing error during inference .
i loaded the model like this

from transformers import AutoModel, AutoTokenizer
from Microsoft.configuration_phimoe import PhiMoEConfig  
from Microsoft.modeling_phimoe import PhiMoEModel      
AutoModel.register(PhiMoEConfig, PhiMoEModel)
model_directory = "/home/models/Microsoft"
config = PhiMoEConfig.from_pretrained(model_directory)
model = PhiMoEModel.from_pretrained(model_directory, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_directory)
input_text = "This is a test sentence."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state)

but the model is loading up in the CPU.

What happens if you try this?


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3.5-MoE-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

I am able to load the model up in the gpu. but the problem is i have 23gigs of vram and when i load the model using above approach the cuda goes out of memory.
According to my finding MoE referes to the Mixture of Experts in which we only load certain checkpoints but the question is how can we know which checkpoint holds which Experts.

MoE need all the weights in memory, but then they will only use a fraction of them. For a model this big, you probably need 40GB of memory to run in 4bit precision.

Here is how you can load it in 4bit precision.

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3.5-MoE-instruct",  
    device_map="cuda",  
    load_in_4bit=True,  
    trust_remote_code=True,  
) 

Oh oh i seee. so even for 4 bit precision i need 40gb of memory thats huge. So the feasible way of doing will be to use less parameter gguf using llama cpp?

I'm not sure it is supported in llama.cpp yet: https://github.com/ggerganov/llama.cpp/issues/9168

If you are on an mac with m1/2/3 silicon, you could try one of the mlx models: https://huggingface.co/mlx-community/Phi-3.5-MoE-instruct-4bit

If you quantize it, you can use vllm to run it on gpus

i am using mistral.rs and i want to know how much vram is required to run the model using mistral.rs. so this command here

./mistralrs-server --isq Q4K -i plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe

is quantizing the model so that means i have to load whole model in 4 bits ( which require around 40 gigs of vram) and then use the quantized version which use lesser vram

You should ask on mistral.rs for that

Sign up or log in to comment