extreme slowdown and weird output.
#4
by
abhimortal6
- opened
Tried in oobabooga web_ui, not usable in my case
3060ti 8GB VRAM, 24GB RAM
Output is weird it never returns the code.
Output generated in 27.06 seconds (0.30 tokens/s, 8 tokens, context 63, seed 1191894163)
Output generated in 66.12 seconds (0.47 tokens/s, 31 tokens, context 80, seed 1706855517)
Output generated in 386.01 seconds (0.04 tokens/s, 16 tokens, context 131, seed 1791131008)
Output generated in 50.16 seconds (0.48 tokens/s, 24 tokens, context 118, seed 1161001351)
Output generated in 23.89 seconds (0.04 tokens/s, 1 tokens, context 150, seed 1752912455)
Output generated in 202.32 seconds (0.05 tokens/s, 10 tokens, context 169, seed 1726966570)
This is a quantitized 15b model. Also, how did you get it to run?
This is a quantitized 15b model. Also, how did you get it to run?
Sure title says so, quality is decreased marginally though. To run in webui use configs from ->
https://huggingface.co/ShipItMind/starcoder-gptq-4bit-128g
hey, just pushed some new fixes.
Can you give those a try?
This comment has been hidden
Sorry but where are the updated files? current repo showing last updated month ago
the fixes are in the repo: https://github.com/mayank31398/GPTQ-for-SantaCoder
The weights are same.
OOM
3060ti 8GB VRAM, 24GB RAM
python -m santacoder_inference bigcode/starcoder --wbits 4 --groupsize 128 --load starcoder-GPTQ-4bit-128g/model.pt
Traceback (most recent call last):
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/abhi/Documents/starcoder/GPTQ-for-SantaCoder/santacoder_inference.py", line 96, in <module>
main()
File "/home/abhi/Documents/starcoder/GPTQ-for-SantaCoder/santacoder_inference.py", line 86, in main
model = get_santacoder(args.model, args.load, args.wbits, args.groupsize)
File "/home/abhi/Documents/starcoder/GPTQ-for-SantaCoder/santacoder_inference.py", line 49, in get_santacoder
model = model.cuda()
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 905, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 844, in _apply
self._buffers[key] = fn(buf)
File "/home/abhi/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 905, in <lambda>
return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 7.78 GiB total capacity; 6.68 GiB already allocated; 75.31 MiB free; 6.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
yeah, its not supposed to work with 3060ti.
alright, closing.
abhimortal6
changed discussion status to
closed