OSError: Unable to load vocabulary from file
What is the possible cause of the following error?
File "/Users/kn/mylangchainenv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2327, in _from_pretrained
raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
How are you loading the tokenizer?
Are you sure your copy of the files (either in a local dir, or in the HF cache) is accessible and not corrupt? You could try re-downloading the file if there is no other apparent reason.
Hey Owen,
I have following test script.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token="")
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="")
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
I presume you specified your token in your actual code, or else you'd have a different error.
Was there any error in download? what is your HF cache path (that is, what file is it reading) and can you delete that part of the cache and try again?
Hi
@khurramnaseem
, we just updated the tokenizer to use the standard GPT2Tokenizer
class, could you try again and let me know if it works?
Hey
@abhi-db
Yes! seem its work now, it ask me to do "pip install accelerate" and after done so it start downloading following files.
model.safetensors.index.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29.3k/29.3k [00:00<00:00, 1.03MB/s]
model-00001-of-00061.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3.52G/3.52G [10:49<00:00, 5.42MB/s]
model-00002-of-00061.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.40G/4.40G [18:16<00:00, 4.02MB/s]
model-00003-of-00061.safetensors:
But seem its a lots of data, I'm not sure what is the purpose of all these files.
All of these files you are downloading are simply the model weights. More specifically, the files that end in .safetensors
are files that contain model weights. We also saved our model weights in 61 different files because our model is "shared" into different pieces. This is normal :)
thank you @eitanturok & @abhi-db for all the help.