Text Generation
Transformers
Safetensors
English
llama
nvidia
llama3.1
conversational
text-generation-inference

Any way I can run it on my low-mid tier HP Desktop? specs attached as a .png, btw i know its probably a long shot.

#18
by vgrowhouse - opened

image1.png

100% stock no upgraded ram. Also if you are reading this, could my old GTS450 run this?

A simple answer is, no, it's like trying to fit a train in a car or rather, a bike
It's on huggingchat so use it there instead

WTF. Are you running on systems lol!. Bro you even can't run on Kaggle or Collabs (best freely available Notebooks).

A refurbished mac studio m1 ultra with 128gb RAM can be found on e-bay for $2.5k-$3k and can run 70b models at q8 at ~7.5 tokens/sec which IMO is perfect for chatting (slightly above my reading speed). Up to 8k tokens it is still OK at ~5 tokens/sec.

It can also fit a 64k context in VRAM if you mess around with iogpu.wired_limit_mb (increasing the max VRAM allocation), but with 32k tokens in the context the speed drops to around 2 tokens/sec which is not good for interactive chat but still usable if you are not in a rush (eg: ask it to summarize a big document and go for a walk).

Even better, you can get a m2 or m3 mini mac for about 600 - 800 dollars and use it soley for this purpose

Even better, you can get a m2 or m3 mini mac for about 600 - 800 dollars and use it soley for this purpose

Yes a mac mini can fit a 70b model in VRAM, but memory bandwidth and GPU performance doesn't compare with mac studio with ultra processor. Here's a video of someone running a 70b model in mac mini:https://www.youtube.com/watch?v=xyKEQjUzfAk (it works but very slow).

Sign up or log in to comment