Code example request with vllm

#1
by ShiningJazz - opened

Can anyone give me some example code to use this model with vllm library?
I'm a newbie on LLM and vllm library.

Especially, I want what method or string should be in to quantization parameter:
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=4, quantization=)

ShiningJazz changed discussion title from Example request with vllm to Code example request with vllm
Neural Magic org

You can just run with:

from vllm import LLM
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=4)
output = model.generate("Hello my name is")
Neural Magic org
edited Jul 12

You need not specify the quantization argument since it will be inferred from the checkpoint.
You could use the following code snippet:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=300)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=4)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
abhinavnmagic changed discussion status to closed

Sign up or log in to comment