it run
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "mistralai/Mamba-Codestral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
inputs = tokenizer("Hello world, today", return_tensors="pt").to(0)
output = model.generate(**inputs, max_new_tokens=100, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
low_cpu_mem_usage
was None, now default to True since model is quantized.
Loading checkpoint shards: 100%
3/3 [01:25<00:00, 28.49s/it]
/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py:452: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
warnings.warn(
Hello world, today IBMAF,
After a great day yesterday I returned to the club with very high expectations. To start the next phase of the day I had the chance to get into the hanger and test it out a bit to prepare for my race to come on Wednesdays. After a few spins around the hanger I decided to pack it up to see what I had accomplished today.
The first thing I did today was work on the plane which I have been doing for
!mistral-chat $HOME/mistral_models/Mamba-Codestral-7B-v0.1 --instruct --max_tokens 256
not run