I love the mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

#16
by johanteekens - opened

mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf is amazing!

Fast, smart, process logic, create code, SQL etc running like crazy on my 2 x RTX 3090's

llama_print_timings: load time = 731.83 ms
llama_print_timings: sample time = 19.96 ms / 199 runs ( 0.10 ms per token, 9968.94 tokens per second)
llama_print_timings: prompt eval time = 279.92 ms / 17 tokens ( 16.47 ms per token, 60.73 tokens per second)
llama_print_timings: eval time = 6256.11 ms / 198 runs ( 31.60 ms per token, 31.65 tokens per second)
llama_print_timings: total time = 6778.80 ms

Did you change any of the config settings? If so what?

Nothing special.

Hardware: 2 x RTX 3090 on a ASUS B650 creator proart motherboard with AMD 7800, 32 GB mem.
OS: Ubuntu 22.04, Driver Version: 535.129.03 CUDA Version: 12.2, Docker with CUDA support.
My standard development container: https://github.com/johanteekens/ml-dev-docker (This compiles llama.cpp from github while building, please check Dockerfile.)
llm = LlamaCPP(
model_url=None,
model_path="mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf",
temperature=0.1,
max_new_tokens=4800,
context_window=3900,
generate_kwargs={},
model_kwargs={"n_gpu_layers": 50},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

hi, How many mem does it need?
30G maybe?

Sign up or log in to comment