Spaces:
Runtime error
Apply for community grant: Personal project (gpu and storage)
I'm training a SOTA JA/EN bilingual open model (already beats out the current best JA model, stabilityai's 70B in both JA fluency and benchmarks) - doing a proper training run this week and will be releasing (model, datasets, code) soon and figured it might be nice to try setting up a space so people could easily try it out (at least at launch)?
Hi @randomfoo, we've assigned t4-small to this Space with 15 minute sleep time for now as the Space is not ready yet. Let us know when it's ready so we can change the sleep time to 1 hour or something.
Hiya,
So I switched to gradio since it seems like it was easier to setup a chat interface (sort of, but holy crap the docs are bad and it took waaaaay too much time to get it working). Still finally got it working deving on local system. A few questions on this space:
I originally started with streamlit, can I switch this space to a gradio instance? I didn't see that in settings. If not do I start a new space and followup again? I guess the convos are attached per space? I will probably also rename this space, just wonder if that'll cause problems?
I'm trying to run a 7B, it seems to maybe be running out of ram or vram - I tried load_in_4bit w/ bnb but it's still not going well. In theory, Q4 should be <7GB of VRAM right?
Is there any way to save the model locally? every time it restarts, it's pulling the model again? Is that right? (i'm just pulling mistral-7b from right now)
Thanks!
I originally started with streamlit, can I switch this space to a gradio instance? I didn't see that in settings. If not do I start a new space and followup again? I guess the convos are attached per space? I will probably also rename this space, just wonder if that'll cause problems?
You can change the SDK from streamlit to gradio by updating your README.md. https://huggingface.co/spaces/augmxnt/test7b/blob/d3878745e30ebbebfb3521bbbea4c830d68319e7/README.md?code=true#L6-L7
I think it'll be fine to rename your Space.
I'm trying to run a 7B, it seems to maybe be running out of ram or vram - I tried load_in_4bit w/ bnb but it's still not going well. In theory, Q4 should be <7GB of VRAM right?
I think you should test your demo on your local machine or Google Colab, etc. first. T4 is usually enough for 7B models if you load it in 4bit or 8bit.
Is there any way to save the model locally? every time it restarts, it's pulling the model again? Is that right? (i'm just pulling mistral-7b from right now)
You can attach the persistent storage to your Space and set an environment variable HF_HOME=/data
. But generally speaking, it's a good idea to debug your demo on your local environment before deploying it to Spaces.
Hi, took a little longer than expected but we've launched our model now, wondering if it'd be possible to get this space's spec upped for a while w/ a longer sleep time and maybe some local storage for a while? Our model: https://huggingface.co/augmxnt/shisa-7b-v1
(also, did the original grant expire, looks like it's back on CPU so doesn't run, sadly)
Hi @leonardlin , sorry, looks like I missed your comment. I assigned t4-small and set the sleep time to 1 hour.
@leonardlin
Can you update the title of the Space in your README.md
as well? https://huggingface.co/spaces/augmxnt/shisa/blob/1a6b000d2ad1ca21ef95caabf91cb94d9fc8c935/README.md?code=true#L2
Ah thanks, just updated the README.md
- first time using a space so still learning the ropes :)
Thanks!
@leonardlin I just noticed that your Space is not working properly due to CUDA OOM. I've upgraded the hardware to a10g-small for now. But it would be nice if you could look into it.
logs:
===== Application Startup at 2023-12-07 05:07:31 =====
tokenizer_config.json: 0%| | 0.00/11.4k [00:00<?, ?B/s]
tokenizer_config.json: 100%|โโโโโโโโโโ| 11.4k/11.4k [00:00<00:00, 37.2MB/s]
tokenizer.model: 0%| | 0.00/493k [00:00<?, ?B/s]
tokenizer.model: 100%|โโโโโโโโโโ| 493k/493k [00:00<00:00, 101MB/s]
tokenizer.json: 0%| | 0.00/6.14M [00:00<?, ?B/s]
tokenizer.json: 100%|โโโโโโโโโโ| 6.14M/6.14M [00:00<00:00, 75.5MB/s]
added_tokens.json: 0%| | 0.00/1.42k [00:00<?, ?B/s]
added_tokens.json: 100%|โโโโโโโโโโ| 1.42k/1.42k [00:00<00:00, 6.24MB/s]
special_tokens_map.json: 0%| | 0.00/552 [00:00<?, ?B/s]
special_tokens_map.json: 100%|โโโโโโโโโโ| 552/552 [00:00<00:00, 2.32MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
config.json: 0%| | 0.00/605 [00:00<?, ?B/s]
config.json: 100%|โโโโโโโโโโ| 605/605 [00:00<00:00, 2.68MB/s]
model.safetensors.index.json: 0%| | 0.00/23.9k [00:00<?, ?B/s]
model.safetensors.index.json: 100%|โโโโโโโโโโ| 23.9k/23.9k [00:00<00:00, 63.9MB/s]
Downloading shards: 0%| | 0/5 [00:00<?, ?it/s]
model-00001-of-00005.safetensors: 0%| | 0.00/3.92G [00:00<?, ?B/s]
model-00001-of-00005.safetensors: 0%| | 10.5M/3.92G [00:03<19:25, 3.35MB/s]
model-00001-of-00005.safetensors: 27%|โโโ | 1.06G/3.92G [00:05<00:12, 235MB/s]
model-00001-of-00005.safetensors: 54%|โโโโโโ | 2.13G/3.92G [00:06<00:04, 435MB/s]
model-00001-of-00005.safetensors: 71%|โโโโโโโ | 2.77G/3.92G [00:08<00:02, 389MB/s]
model-00001-of-00005.safetensors: 84%|โโโโโโโโโ | 3.28G/3.92G [00:09<00:01, 381MB/s]
model-00001-of-00005.safetensors: 100%|โโโโโโโโโโ| 3.92G/3.92G [00:10<00:00, 385MB/s]
Downloading shards: 20%|โโ | 1/5 [00:10<00:41, 10.46s/it]
model-00002-of-00005.safetensors: 0%| | 0.00/3.93G [00:00<?, ?B/s]
model-00002-of-00005.safetensors: 0%| | 10.5M/3.93G [00:01<09:46, 6.67MB/s]
model-00002-of-00005.safetensors: 1%|โ | 52.4M/3.93G [00:02<02:53, 22.4MB/s]
model-00002-of-00005.safetensors: 4%|โ | 147M/3.93G [00:03<01:22, 46.0MB/s]
model-00002-of-00005.safetensors: 5%|โ | 210M/3.93G [00:04<01:12, 51.3MB/s]
model-00002-of-00005.safetensors: 10%|โ | 398M/3.93G [00:06<00:42, 82.6MB/s]
model-00002-of-00005.safetensors: 26%|โโโ | 1.01G/3.93G [00:07<00:12, 229MB/s]
model-00002-of-00005.safetensors: 32%|โโโโ | 1.26G/3.93G [00:10<00:18, 142MB/s]
model-00002-of-00005.safetensors: 37%|โโโโ | 1.46G/3.93G [00:12<00:18, 135MB/s]
model-00002-of-00005.safetensors: 48%|โโโโโ | 1.90G/3.93G [00:13<00:10, 195MB/s]
model-00002-of-00005.safetensors: 55%|โโโโโโ | 2.16G/3.93G [00:15<00:10, 162MB/s]
model-00002-of-00005.safetensors: 60%|โโโโโโ | 2.37G/3.93G [00:18<00:12, 126MB/s]
model-00002-of-00005.safetensors: 65%|โโโโโโโ | 2.54G/3.93G [00:19<00:10, 133MB/s]
model-00002-of-00005.safetensors: 76%|โโโโโโโโ | 2.99G/3.93G [00:20<00:04, 194MB/s]
model-00002-of-00005.safetensors: 83%|โโโโโโโโโ | 3.25G/3.93G [00:22<00:03, 180MB/s]
model-00002-of-00005.safetensors: 89%|โโโโโโโโโ | 3.48G/3.93G [00:24<00:02, 167MB/s]
model-00002-of-00005.safetensors: 100%|โโโโโโโโโโ| 3.93G/3.93G [00:24<00:00, 160MB/s]
Downloading shards: 40%|โโโโ | 2/5 [00:35<00:56, 18.99s/it]
model-00003-of-00005.safetensors: 0%| | 0.00/3.93G [00:00<?, ?B/s]
model-00003-of-00005.safetensors: 0%| | 10.5M/3.93G [00:03<19:15, 3.39MB/s]
model-00003-of-00005.safetensors: 3%|โ | 126M/3.93G [00:04<01:39, 38.2MB/s]
model-00003-of-00005.safetensors: 8%|โ | 325M/3.93G [00:05<00:41, 87.7MB/s]
model-00003-of-00005.safetensors: 17%|โโ | 682M/3.93G [00:06<00:19, 170MB/s]
model-00003-of-00005.safetensors: 24%|โโโ | 923M/3.93G [00:07<00:15, 188MB/s]
model-00003-of-00005.safetensors: 29%|โโโ | 1.15G/3.93G [00:11<00:26, 105MB/s]
model-00003-of-00005.safetensors: 34%|โโโโ | 1.32G/3.93G [00:12<00:24, 107MB/s]
model-00003-of-00005.safetensors: 45%|โโโโโ | 1.76G/3.93G [00:13<00:12, 171MB/s]
model-00003-of-00005.safetensors: 51%|โโโโโ | 2.00G/3.93G [00:14<00:10, 181MB/s]
model-00003-of-00005.safetensors: 57%|โโโโโโ | 2.24G/3.93G [00:18<00:14, 115MB/s]
model-00003-of-00005.safetensors: 66%|โโโโโโโ | 2.58G/3.93G [00:19<00:09, 149MB/s]
model-00003-of-00005.safetensors: 72%|โโโโโโโโ | 2.81G/3.93G [00:21<00:06, 163MB/s]
model-00003-of-00005.safetensors: 77%|โโโโโโโโ | 3.04G/3.93G [00:22<00:05, 175MB/s]
model-00003-of-00005.safetensors: 83%|โโโโโโโโโ | 3.27G/3.93G [00:23<00:04, 157MB/s]
model-00003-of-00005.safetensors: 91%|โโโโโโโโโโ| 3.59G/3.93G [00:24<00:01, 192MB/s]
model-00003-of-00005.safetensors: 100%|โโโโโโโโโโ| 3.93G/3.93G [00:25<00:00, 156MB/s]
Downloading shards: 60%|โโโโโโ | 3/5 [01:00<00:43, 21.98s/it]
model-00004-of-00005.safetensors: 0%| | 0.00/3.17G [00:00<?, ?B/s]
model-00004-of-00005.safetensors: 0%| | 10.5M/3.17G [00:02<14:19, 3.68MB/s]
model-00004-of-00005.safetensors: 5%|โ | 168M/3.17G [00:03<00:55, 54.5MB/s]
model-00004-of-00005.safetensors: 18%|โโ | 577M/3.17G [00:04<00:15, 168MB/s]
model-00004-of-00005.safetensors: 26%|โโโ | 818M/3.17G [00:07<00:19, 124MB/s]
model-00004-of-00005.safetensors: 33%|โโโโ | 1.06G/3.17G [00:09<00:18, 115MB/s]
model-00004-of-00005.safetensors: 38%|โโโโ | 1.22G/3.17G [00:11<00:15, 122MB/s]
model-00004-of-00005.safetensors: 55%|โโโโโโ | 1.73G/3.17G [00:12<00:06, 208MB/s]
model-00004-of-00005.safetensors: 63%|โโโโโโโ | 2.00G/3.17G [00:14<00:06, 175MB/s]
model-00004-of-00005.safetensors: 70%|โโโโโโโ | 2.23G/3.17G [00:18<00:08, 113MB/s]
model-00004-of-00005.safetensors: 100%|โโโโโโโโโโ| 3.17G/3.17G [00:19<00:00, 165MB/s]
Downloading shards: 80%|โโโโโโโโ | 4/5 [01:20<00:21, 21.01s/it]
model-00005-of-00005.safetensors: 0%| | 0.00/984M [00:00<?, ?B/s]
model-00005-of-00005.safetensors: 1%| | 10.5M/984M [00:01<02:48, 5.79MB/s]
model-00005-of-00005.safetensors: 6%|โ | 62.9M/984M [00:02<00:35, 26.1MB/s]
model-00005-of-00005.safetensors: 100%|โโโโโโโโโโ| 984M/984M [00:03<00:00, 264MB/s]
Downloading shards: 100%|โโโโโโโโโโ| 5/5 [01:24<00:00, 14.89s/it]
Downloading shards: 100%|โโโโโโโโโโ| 5/5 [01:24<00:00, 16.90s/it]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 20%|โโ | 1/5 [00:11<00:44, 11.09s/it]
Loading checkpoint shards: 40%|โโโโ | 2/5 [00:23<00:35, 11.67s/it]
Loading checkpoint shards: 60%|โโโโโโ | 3/5 [00:35<00:23, 11.84s/it]
Loading checkpoint shards: 80%|โโโโโโโโ | 4/5 [00:44<00:11, 11.01s/it]
Loading checkpoint shards: 100%|โโโโโโโโโโ| 5/5 [00:47<00:00, 8.11s/it]
Loading checkpoint shards: 100%|โโโโโโโโโโ| 5/5 [00:47<00:00, 9.58s/it]
generation_config.json: 0%| | 0.00/133 [00:00<?, ?B/s]
generation_config.json: 100%|โโโโโโโโโโ| 133/133 [00:00<00:00, 531kB/s]
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-11 (generate):
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
return self.sample(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
outputs = self(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
outputs = self.model(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
layer_outputs = decoder_layer(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 639, in forward
hidden_states = self.mlp(hidden_states)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 175, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 386, in forward
state.subB = (outliers * state.SCB.view(-1, 1) / 127.0).t().contiguous().to(A.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 161.56 MiB is free. Process 791650 has 14.42 GiB memory in use. Of the allocated memory 11.95 GiB is allocated by PyTorch, and 2.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-12 (generate):
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
return self.sample(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
outputs = self(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
outputs = self.model(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
layer_outputs = decoder_layer(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 626, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 244, in forward
query_states = self.q_proj(hidden_states)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 421, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (57x2083 and 2080x4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-13 (generate):
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
return self.sample(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
outputs = self(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
outputs = self.model(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
layer_outputs = decoder_layer(
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 639, in forward
hidden_states = self.mlp(hidden_states)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 175, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 386, in forward
state.subB = (outliers * state.SCB.view(-1, 1) / 127.0).t().contiguous().to(A.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 222.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 25.56 MiB is free. Process 791650 has 14.55 GiB memory in use. Of the allocated memory 12.16 GiB is allocated by PyTorch, and 2.25 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi, I'll take a look, I have it w/ load_in_8bits
and it spun up fine originally, so not sure why it ran out. I'll test it out locally on my dev box just to sanity check the size tomorrow!
@hysts
OK, figured out the issue, I was testing with Mistral 7B before but our model uses more memory (because of the tokenizer?) and goes over way over. I switched the code to load_in_4bit
and it should load up in like ~5GB VRAM, although it grows w/ context... On my local box I was using use_flash_attention_2 which saves some memory, but when I put it in my requirements the build complained about not having torch. Is there a way to stage library installs w/ spaces?
@leonardlin
Thanks for looking into this. Hmm, not sure but maybe you can try adding torch
in pre-requirements.txt
?
https://huggingface.co/docs/hub/spaces-dependencies#adding-your-own-dependencies
OK, pre-requirements.txt
done, FA2 running, that was an adventure. Looks like the EN announcement starting to percolate through JA LLM twitter so will be good to see the response: https://twitter.com/webbigdata/status/1733044645687595382
BTW, made some interesting discoveries along the way in case you guys are going to make a default docs/templates for deploying LLM demos - the Gradio default chat example uses streamer in a threadpool, but streamer is actually not threadsafe and will end up barfing context between sessions. Also, as of 4.3.0, the docs say the examples
should be passed as a list, but if you have additional_inputs
, then it breaks and has to be a list of lists. I have no idea why ๐
Also concurrency_limit
seemed to be another heisenbug (eg seemed to work locally but not on the space), so that was sort of an adventure!
Anyway, thanks again for all the help w/ my first HF Spaces experience!
@leonardlin Thanks for the feedback! I'll share this internally.
Hello @leonardlin , thank you for your feedback. We greatly appreciate hearing from our users.
Also concurrency_limit seemed to be another heisenbug (eg seemed to work locally but not on the space),
How do you mean? During my testing with Colab and Spaces demo, I found that the chat interface is respecting the queue size.
the docs say the examples should be passed as a list, but if you have additional_inputs, then it breaks and has to be a list of lists.
We can maybe make our documentation clearer by providing examples that include additional inputs.
When calling a function (for text generation in this case) with multiple inputs (text prompt and additional inputs in this case), we need to provide example values for all of them in a list. If we have more than one example, we need to pass in a list of lists of example inputs. You can find more information on this in our Docs (https://www.gradio.app/docs/examples#initialization) and guides (https://www.gradio.app/guides/more-on-examples#providing-examples).
Heya @ysharma
How do you mean? During my testing with Colab and Spaces demo, I found that the chat interface is respecting the queue size.
Well heisenbug because in the documentation and locally I was able to assign concurrency_limit
but on the HF space it barfed. Since it was a clean rebuild I assumed it'd be pulling the same version but maybe not?
the docs say the examples should be passed as a list, but if you have additional_inputs, then it breaks and has to be a list of lists.
The documentation for https://www.gradio.app/docs/chatinterface#initialization says:
examples
list[str] | None
default: Nonesample inputs for the function; if provided, appear below the chatbot and can be clicked to populate the chatbot input.
and this in fact list[str]
works, but if you also include additional_inputs, it needs to be changed to list[list[str]]
. If it's supposed to mirror https://www.gradio.app/docs/examples might need to link there, maybe there's a better way for the docs to stay in sync if all Examples() are supposed to behave the same way.