Help Needed!! Text Generation Taking Too Long
Hi, I am new to NLP and am still learning. I am using a VM of GCP(e2-highmem-4 (Efficient Instance, 4 vCPUs, 32 GB RAM)) to load the model and use it. Here is the code I have written-
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import transformers
config = transformers.AutoConfig.from_pretrained(
'mosaicml/mpt-7b-instruct',
trust_remote_code=True,
)
# config.attn_config['attn_impl'] = 'flash'
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-instruct',
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
cache_dir="./cache"
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", cache_dir="./cache")
text_gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
text_gen(text_inputs="what is 2+2?")
Now the code is taking way too long to generate the text. Am I doing something wrong? or is there any way to make things faster?
Also, when creating the pipeline, I am getting the following warning-
The model 'MPTForCausalLM' is not supported for text-generation
I saw in another discussion that this shouldn't be a problem as the architecture is custom?
Hi @ debajyoti111, could you try removing the line torch_dtype=torch.bfloat16
. I'm seeing in another post that on some CPU machines, this causes it to run very slowly. Removing this line will fallback to default torch.float32
weights and math.
The not supported for text-generation
warning can be ignored.
Also taking a step back, to separate the system from the code, can you confirm if MPT is faster/slower than when you run generation with other HF models like OPT-6.7B
? In general, running LLMs on CPUs is going to be very slow without a custom framework like GGML. Right now we are focused mainly on GPU inference, which should be quite fast when using the attn_impl: triton
backend.
Let me know if the generation speed gets better!
Closing as stale.
Also wanted to note that we added support for device_map
and faster KV cacheing in this PR: https://huggingface.co/mosaicml/mpt-7b-instruct/discussions/41