Edit model card

This repo contains GGUF model files for cross-platform AI inference using the WasmEdge Runtime. Learn more on why and how.

Prerequisite

Install WasmEdge with the GGML plugin.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

Download the cross-platform Wasm apps for inference.

curl -LO https://github.com/second-state/llama-utils/raw/main/simple/llama-simple.wasm

curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm

Use the quantized models

The q5_k_m version is a quantized version of the llama2 models. They are only half of the size of the original models, and hence consume half as much VRAM, but still give high-quality inference results.

Chat with the 7b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm

Generate text with the 7b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-q5_k_m.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '

Chat with the 13b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q5_k_m.gguf llama-chat.wasm

Generate text with the 13b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-q5_k_m.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '

Use the f16 models

The f16 version is the GGUF equivalent of the original llama2 models. It gives the best quality inference results but also consumes the most computing resources in both VRAM and computing time. The f16 models are also great as a basis for fine-tuning.

Chat with the 7b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-f16.gguf llama-chat.wasm

Generate text with the 7b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-f16.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '

Chat with the 13b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-f16.gguf llama-chat.wasm

Generate text with the 13b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-f16.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '

Resource constrained models

The q2_k version is the smallest quantized version of the llama2 models. They can run on devices with only 4GB of RAM, but the inference quality is rather low.

Chat with the 7b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q2_k.gguf llama-chat.wasm

Generate text with the 7b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-q2_k.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '

Chat with the 13b chat model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q2_k.gguf llama-chat.wasm

Generate text with the 13b base model

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-q2_k.gguf llama-simple.wasm 'Robert Oppenheimer most important achievement is '
Downloads last month
247
GGUF
Model size
6.74B params
Architecture
llama

2-bit

5-bit

16-bit

Inference Examples
Inference API (serverless) has been turned off for this model.