Finetuned Falcon40 is not working with pipeline (text-generation)
Hi,
First of all thanks for the great job I really love the Falcon models and for my task it performs better than Llama2 70B!
I have finetuned the falcon40 (not instruct) on my task, using QLora and Peft.
I am now in the process of deploying it using AWS Sagemaker. There are several problems but I would like to focus on one you might help me with.
When I load the model straight from the hub, create a pipeline and infer it, i get a response for a query in 150 seconds. It works great!
model_name = "tiiuae/falcon-40b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True,
device_map="auto",
)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")
The problem is when I try to use my finetuned model along with the pipeline.
I tried 2 options:
- passing the pipeline the peft model
PEFT_MODEL = 'models/falcon40_ft_sft'
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
return_dict=True,
quantization_config=bnb_config,
low_cpu_mem_usage=True,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b")
tokenizer.pad_token = tokenizer.eos_token
model = PeftModel.from_pretrained(model, PEFT_MODEL)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")
I get an error: The model 'PeftModel' is not supported for text-generation. Supported models are [...]
And the response time for the same query takes twice as much time.
- Merging the peft model
Then I tried to merge the peft model using:
merged_model = model.merge_and_unload()
merged_model.save_pretrained('models/merged_ft_sft_falcon40')
tokenizer.save_pretrained('models/merged_ft_sft_falcon40')
I have copied all the configuration files and the including the config file which contains RWForCausalLM function. I copied also the config.json with the right auto_map properties.
When I run
MERGED_MODEL = 'models/merged_ft_sft_falcon40'
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MERGED_MODEL,
return_dict=True,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MERGED_MODEL)
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device_map="auto")
I get an error: The model 'RWForCausalLM' is not supported for text-generation. Supported models are [...]
And the response time for the same query takes twice as much time.
My question is how can I make my finetuned model benefit from all of the pipeline functions? Why isn't it working the same as the hub model, given I have all files (my assumption is that only the weights files have changed slightly)?