Llama 3.2 OpenVINO
Collection
Llama 3.2 instruction-tuned model in OpenVINO format.
•
4 items
•
Updated
This is Llama-3.2-3B-Instruct model converted to the OpenVINO™ IR (Intermediate Representation) format with weights compressed to INT8 by NNCF.
Weight compression was performed using nncf.compress_weights
with the following parameters:
For more information on quantization, check the OpenVINO model optimization guide.
The provided OpenVINO™ IR model is compatible with:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{input}<|eot_id|>
pip install optimum[openvino]
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
model_id = "srang992/Llama-3.2-3B-Instruct-ov-INT8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)
inputs = tokenizer("What is OpenVINO?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)
For more examples and possible optimizations, refer to the OpenVINO Large Language Model Inference Guide.
pip install openvino-genai huggingface_hub
import huggingface_hub as hf_hub
model_id = "srang992/Llama-3.2-3B-Instruct-ov-INT8"
model_path = "Llama-3.2-3B-Instruct-ov-INT8"
hf_hub.snapshot_download(model_id, local_dir=model_path)
import openvino_genai as ov_genai
device = "CPU"
pipe = ov_genai.LLMPipeline(model_path, device)
print(pipe.generate("What is OpenVINO?", max_length=200))
Base model
meta-llama/Llama-3.2-3B-Instruct