WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.

#25
by kmukeshreddy - opened

I am getting "WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu." while loading the Mixtral to text-genetation pipeline.

You don't have enough GPU memory. Consider renting a GPU, or loading the model in a more efficient way (e.g. in 4-Bit)

I second what @cekal said, you probably don't have enough GPU ram to fit the model, try either to load it with smaller precision (e.g. float16 or load_in_4bit, or using the serialized 4-bit here: https://huggingface.co/ybelkada/Mixtral-8x7B-Instruct-v0.1-bnb-4bit)

Hi @ybelkada

Any idea what is the minimum system requirement to run this model (for e.g. GPU, etc..) ? I am trying to run below python code using streamlit and I get the above error (or warning, I would say) -

import streamlit as st
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch

token = ""

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
max_length=1000,
eos_token_id=tokenizer.eos_token_id
)

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})

template = """
You are an intelligent chatbot that gives out useful information to humans.
You return the responses in sentences with arrows at the start of each sentence
{query}
"""

prompt = PromptTemplate(template=template, input_variables=["query"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

print(llm_chain.invoke('What are the 3 causes of glacier meltdowns?'))

I have an error or something i dont understand, what to do?, Thanks

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 23.97it/s]
Some parameters are on the meta device device because they were offloaded to the cpu and disk.
Setting pad_token_id to eos_token_id:128009 for open-end generation.

the code ///

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "/MyPath/Meta-Llama-3.1-8B-Instruct"

Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)

Set the padding token to be the same as the EOS token

if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

Define the messages

messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]

Prepare input IDs and attention mask

inputs = tokenizer(
[msg["content"] for msg in messages],
return_tensors="pt",
padding=True,
truncation=True,
)

Ensure inputs are moved to the correct device

input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)

Set terminators

terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

Generate text

outputs = model.generate(
input_ids,
attention_mask=attention_mask, # Add attention mask
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)

Decode and print the response

response = outputs[0][input_ids.shape[-1]:]
print("Generated Response:", tokenizer.decode(response, skip_special_tokens=True))

Additional debugging output

print("Inputs:")
print(inputs)
print("Generated Output IDs:")
print(outputs)

Sign up or log in to comment