Custom handler.py and requirements.txt
Does anyone have an example handler.py and requirements.txt?
We keep getting:
RuntimeError: The size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3
This is the current handler.py
we are testing that is failing:
class EndpointHandler:
def __init__(self, path="", force_cpu: bool = False):
import torch
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
if force_cpu:
torch.cuda.is_available = _force_not_available
self.generate_text = pipeline(model=path, #
torch_dtype=torch.bfloat16, #
trust_remote_code=True, #
# low_cpu_mem_usage=True, #
)
else:
self.tokenizer = AutoTokenizer.from_pretrained( #
path, padding_side="left")
self.model = AutoModelForCausalLM.from_pretrained( #
path, #
torch_dtype=torch.float16, #
trust_remote_code=True, #
load_in_8bit=True, #
device_map="auto", #
low_cpu_mem_usage=True, #
)
from instruct_pipeline import InstructionTextGenerationPipeline
self.generate_text = InstructionTextGenerationPipeline( #
model=self.model, #
tokenizer=self.tokenizer, #
)
def __call__(self, data: Dict[str, Any]) -> List[Dict[str, any]]:
# process input
inputs = data.pop("inputs", data)
parameters = data.pop("parameters", None)
# pass inputs with all kwargs in data
if parameters is not None:
output = self.generate_text(inputs, **parameters)
else:
output = self.generate_text(inputs)
# return_value: List[Dict[str, any]] = list()
# postprocess the prediction
gpu_info = report_gpu_usage()
output["gpu_info"] = gpu_info
# return_value.append({"generated_text": prediction, "gpu_info": gpu_info})
# return return_value
return output
class BlockTimer(object):
def __enter__(self):
import time
self.start = time.perf_counter()
return self
def __exit__(self, typ, value, traceback):
import time
self.duration = time.perf_counter() - self.start
def _force_not_available() -> bool:
return False
def test() -> None:
import textwrap
with BlockTimer() as timer:
print("Model load")
handler = EndpointHandler(path="databricks/dolly-v2-7b", force_cpu=False)
print(f"Elapsed: {round(timer.duration, 2)}")
print()
parameters: dict[str, any] = {"max_new_tokens": 256, #
"min_length": 16, #
} # parameters for text generation
payload = {"inputs": f"{wall_of_text()}", "parameters": parameters}
with BlockTimer() as timer:
print("Inference")
results = handler.__call__(payload)
print(f"Elapsed: {round(timer.duration, 2)}")
print()
for entry in results[0].items():
print()
print(f"=== {entry[0]}")
if entry[0] == "gpu_info":
gpu_info_lines = entry[1].split("\n")
for line in gpu_info_lines:
if "Default |" in line:
print(line)
else:
print(textwrap.fill(str(
entry[1]), 140, drop_whitespace=False, replace_whitespace=False))
def wall_of_text() -> str:
return """
The present invention relates to compositions and methods for the treatment of the
Charcot-Marie-Tooth disease and related disorders. Charcot-Marie-Tooth disease (“CMT
Mining
of publicly available data, describing molecular mechanisms and pathological
manifestations
of the CMT1A disease, allowed us to prioritize a few functional cellular
modules-transcriptional regulation of PMP22 gene, PMP22 protein folding/degradation,
Schwann cell proliferation and apoptosis, death of neurons, extra-cellular matrix
deposition
and remodelling, immune response-as potential legitimate targets for CMT-relevant
therapeutic interventions.
""".replace("\n", " ")
if __name__ == '__main__':
test()
I'm facing the same error on really long input texts.
Even when I specify max_length=2048
and truncation=True
for the tokenizer:
inputs = tokenizer(
prompt,
return_tensors="pt",
max_length=2048,
truncation=True,
)
It's weird because the base model accepts 5120
tokens if you look at the config.json
file.
It's 2048 actually, see https://huggingface.co/databricks/dolly-v2-12b/discussions/10 for discussion of the issue though
Sorry if I wasn't clear, I talked about the model config.json file: https://huggingface.co/databricks/dolly-v2-12b/blob/6d35f0d536712a5fd765b028b1a61af924d3d94b/config.json#L16
It's similar to the one used by the EleutherAI/pythia-12b
model which accepts 5120 tokens in input.
Hm, wouldn't it be https://huggingface.co/databricks/dolly-v2-12b/blob/main/tokenizer_config.json#L5 that matters? I'm not sure
I saw this tokenizer parameter, but it is useless. Keeping this number during tokenization means there is no max_length
, which is false because you will get an error if you try to feed the EleutherAI base model with an input that is more than 5120
tokens.
Dolly v2 12B seems to be fine-tuned on 2048 tokens inputs, so now the model accepts a maximum of 2048 tokens even if the hidden_size
layer is still 5120.
The problem I'm trying to understand is why I keep getting an input tensor of 2049 when I specify a max_length of 2048 to my tokenizer. 🤗
Isn't hidden_size just the dimension of the encoding layers? I don't think that's the same thing.
I think we can fold this into https://huggingface.co/databricks/dolly-v2-12b/discussions/10
So exactly what is the actual fix that I can implement in a handler.py
then? I've never gotten this to work, even setting truncation to 1024 tokens in the tokenizer configuration.
Did you see the discussion in the other thread? Not sure how to change your current code, but it explains why you're getting this. You can't use 2048 tokens even, due to prompting and generation needs too.
Did you see the discussion in the other thread? Not sure how to change your current code, but it explains why you're getting this. You can't use 2048 tokens even, due to prompting and generation needs too.
Can this be set in the tokenizer for truncation or something? Or how do I go about figuring out the actual tokenized length the model is getting so that I can test things?
You can set the pipeline to truncate or truncate yourself. The context window is a fixed property of the model though
I have this in my current code and I'm still getting the 2049 vs 2048 issue?
self.tokenizer = AutoTokenizer.from_pretrained( #
path, #
padding_side="left", #
truncation=True, #
max_length=1024)
self.model = AutoModelForCausalLM.from_pretrained( #
path, #
torch_dtype=torch.float16, #
trust_remote_code=True, #
# load_in_8bit=True, #
device_map="auto", #
low_cpu_mem_usage=True, #
)
from instruct_pipeline import InstructionTextGenerationPipeline
self.generate_text = InstructionTextGenerationPipeline( #
model=self.model, #
tokenizer=self.tokenizer, #
)
What's your input like when this fails and how long is the output? I wouldn't really expect you'd bump up against the context window limit with these settings.
It seems the tokenizer is ignoring the max_length parameter and isn't truncating? The following is generating an input_ids
size of 1998 for the below text.
self.tokenizer = AutoTokenizer.from_pretrained( #
path, #
padding_side="left", #
truncation=True, #
max_length=1024)
def wall_of_text() -> str:
return """
Create a ten to fifteen word intriguing headline for the following article.
The present invention relates to compositions and methods for the treatment of the
Charcot-Marie-Tooth disease and related disorders. Charcot-Marie-Tooth disease (“CMT
Mining
of publicly available data, describing molecular mechanisms and pathological
manifestations
of the CMT1A disease, allowed us to prioritize a few functional cellular
modules-transcriptional regulation of PMP22 gene, PMP22 protein folding/degradation,
Schwann cell proliferation and apoptosis, death of neurons, extra-cellular matrix
deposition
and remodelling, immune response-as potential legitimate targets for CMT-relevant
therapeutic interventions. The combined impact of these deregulated functional modules on
onset and progression of pathological manifestations of Charcot-Marie-Tooth justifies a
potential efficacy of combinatorial CMT treatment. International patent application No.
PCT/EP2008/066457 describes a method of identifying drug candidates for the treatment of
the
Charcot-Marie-Tooth disease by building a dynamic model of the pathology and targeting
functional cellular pathways which are relevant in the regulation of CMT disease.
International patent application No. PCT/EP2008/066468 describes compositions for the
treatment of the Charcot-Marie-Tooth disease which comprise at least two compounds
selected
from the group of multiple drug candidates. The purpose of the present invention is to
provide new therapeutic combinations for treating CMT and related disorders. The invention
thus relates to compositions and methods for treating CMT and related disorders,
in particular toxic or traumatic neuropathy and amyotrophic lateral sclerosis,
using particular drug combinations. An object of this invention more specifically
relates to
a composition comprising baclofen, sorbitol and a compound selected from pilocarpine,
methimazole, mifepristone, naltrexone, rapamycin, flurbiprofen and ketoprofen, salts or
prodrugs thereof, for simultaneous, separate or sequential administration to a mammalian
subject. A particular object of the present invention relates to a composition comprising
baclofen, sorbitol and naltrexone, for simultaneous, separate or sequential administration
to a mammalian subject. Another object of the invention relates to a composition
comprising
(a) rapamycin, (b) mifepristone or naltrexone, and © a PMP22 modulator, for simultaneous,
separate or sequential administration to a mammalian subject. In a particular embodiment,
the PMP22 modulator is selected from acetazolamide, albuterol, amiloride,
aminoglutethimide,
amiodarone, aztreonam, baclofen, balsalazide, betaine, bethanechol, bicalutamide,
bromocriptine, bumetanide, buspirone, carbachol, carbamazepine, carbimazole, cevimeline,
ciprofloxacin, clonidine, curcumin, cyclosporine A, diazepam, diclofenac, dinoprostone,
disulfiram, D-sorbitol, dutasteride, estradiol, exemestane, felbamate, fenofibrate,
finasteride, flumazenil, flunitrazepam, flurbiprofen, furosemide, gabapentingabapentin,
galantamine, haloperidol, ibuprofen, isoproterenol, ketoconazole, ketoprofen, L-carnitine,
liothyronine (T3), lithium, losartan, loxapine, meloxicam, metaproterenol, metaraminol,
metformin, methacholine, methimazole, methylergonovine, metoprolol, metyrapone,
miconazole,
mifepristone, nadolol, naloxone, naltrexone; norfloxacin, pentazocine, phenoxybenzamine,
phenylbutyrate, pilocarpine, pioglitazone, prazosin, propylthiouracil, raloxifene,
rapamycin, rifampin, simvastatin, spironolactone, tacrolimus, tamoxifen, trehalose,
trilostane, valproic acid, salts or prodrugs thereof. 1. A method of improving nerve
regeneration in a human subject suffering from amyotrophic lateral sclerosis,
or a neuropathy selected from an idiopathic neuropathy, diabetic neuropathy,
a toxic neuropathy, a neuropathy induced by a drug treatment, a neuropathy provoked by
HIV,
a neuropathy provoked by radiation, a neuropathy provoked by heavy metals, a neuropathy
provoked by vitamin deficiency states, or a traumatic neuropathy, comprising administering
to the human subject an amount of a composition effective to improve nerve regeneration;
and
wherein the composition comprises baclofen or a pharmaceutically acceptable salt thereof
in
an amount from 1 to 300 mg/kg of the human subject per day; D-sorbitol or a
pharmaceutically
acceptable salt thereof; and naltrexone or a pharmaceutically acceptable salt thereof in
an
amount from 1 to 100 mg/kg of the human subject per day. 2. The method of claim 1,
wherein the composition further comprises a pharmaceutically suitable excipient or
carrier.
3. The method of claim 2, wherein the composition is formulated with a drug eluting
polymer,
a biomolecule, a micelle or liposome-forming lipids or oil in water emulsions,
or pegylated
or solid nanoparticles or microparticles for oral or parenteral or intrathecal
administration. 4. The method of claim 1, wherein the subject suffers from a traumatic
neuropathy arising from brain injury, spinal cord injury, or an injury to peripheral
nerves.
5. The method of claim 1, wherein the D-sorbitol or a pharmaceutically acceptable salt
thereof is D-sorbitol. 6. The method of claim 1, wherein the composition is formulated for
oral administration. 7. The method of claim 6, wherein the composition is a liquid
formulation. 8. The method of claim 1, wherein baclofen or a pharmaceutically acceptable
salt thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone
or a
pharmaceutically acceptable salt thereof are the sole active ingredients. 9. The method of
claim 1, comprising administering to the human subject baclofen or a pharmaceutically
acceptable salt thereof in an amount from 10 to 200 mg/kg of the human subject per day and
naltrexone or a pharmaceutically acceptable salt thereof in an amount from 1 to 50 mg/kg
of
the human subject per day. 10. The method of claim 1, comprising administering to the
human
subject baclofen or a pharmaceutically acceptable salt thereof in an amount from 10 to 200
mg/kg of the human subject per day and naltrexone or a pharmaceutically acceptable salt
thereof in an amount from 1 to 50 mg/kg of the human subject per day. 11. The method of
claim 1, comprising administering to the human subject baclofen or a pharmaceutically
acceptable salt thereof in an amount from 60 mg to 18 mg per day and naltrexone or a
pharmaceutically acceptable salt thereof in an amount from 60 mg to 6 mg per day. 12. The
method of claim 1, comprising administering to the human subject baclofen or a
pharmaceutically acceptable salt thereof in an amount from 60 mg to 12 mg per day and
naltrexone or a pharmaceutically acceptable salt thereof in an amount from 60 mg to 3 mg
per
day. 13. The method of claim 10, wherein baclofen or a pharmaceutically acceptable salt
thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a
pharmaceutically acceptable salt thereof are administered orally to the human subject. 14.
The method of claim 10, wherein baclofen or a pharmaceutically acceptable salt thereof,
D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a
pharmaceutically acceptable salt thereof are administered separately to the human subject.
15. The method of claim 13, wherein baclofen or a pharmaceutically acceptable salt
thereof,
D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a
pharmaceutically acceptable salt thereof are formulated in a liquid formulation. 16. The
method of claim 15, wherein baclofen or a pharmaceutically acceptable salt thereof,
D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a
pharmaceutically acceptable salt thereof are administered to the human subject in divided
doses. 17. The method of claim 15, wherein baclofen or a pharmaceutically acceptable salt
thereof, D-sorbitol or a pharmaceutically acceptable salt thereof, and naltrexone or a
pharmaceutically acceptable salt thereof are administered to the human subject in divided
doses two times daily.
""".replace("\n", " ")
Yeah, what comes out of the tokenizer in this case, its length?
1998 for input_ids
And I haven't counted but your input is longer than that in tokens right? it's not limiting to 1024 tokens though, clearly. This I honestly don't quite know, but I'm aware that this setting has caused some questions: https://huggingface.co/databricks/dolly-v2-12b/blob/main/tokenizer_config.json#L5 Seems like it should be lower, and we've discussed this elsewhere. But I wonder if you somehow need to set model_max_length
instead to 1024? this is new territory for me but it's a decent next guess
Yes, the tokens I'm submitting are 1998 in size, when combined with the built-in prompt and chained output exceed the 2048 limit.
I tried that, after creating the object, it seems to ignore that as well. I'm a bit perplexed trying to figure out to configure the tokenizer to actually do this.
I'm trying to decipher the InstructPipeline this morning to see if I can find a way to properly trim the instruction and context to a length that takes into account the pipeline injected prompt and the max new tokens output count.
So, based on my lack of progress, I'm guessing I'll need to truncate manually to some max value < 2048 based on my max output desired.