Docker

Prerequisites

Docker must be installed and running on your system.
Create a folder to store big models & intermediate files (ex. /llama/models)

Images

We have three Docker images available for this project:

ghcr.io/ggerganov/llama.cpp:full: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: linux/amd64, linux/arm64)
ghcr.io/ggerganov/llama.cpp:light: This image only includes the main executable file. (platforms: linux/amd64, linux/arm64)
ghcr.io/ggerganov/llama.cpp:server: This image only includes the server executable file. (platforms: linux/amd64, linux/arm64)

Additionally, there the following images, similar to the above:

ghcr.io/ggerganov/llama.cpp:full-cuda: Same as full but compiled with CUDA support. (platforms: linux/amd64)
ghcr.io/ggerganov/llama.cpp:light-cuda: Same as light but compiled with CUDA support. (platforms: linux/amd64)
ghcr.io/ggerganov/llama.cpp:server-cuda: Same as server but compiled with CUDA support. (platforms: linux/amd64)
ghcr.io/ggerganov/llama.cpp:full-rocm: Same as full but compiled with ROCm support. (platforms: linux/amd64, linux/arm64)
ghcr.io/ggerganov/llama.cpp:light-rocm: Same as light but compiled with ROCm support. (platforms: linux/amd64, linux/arm64)
ghcr.io/ggerganov/llama.cpp:server-rocm: Same as server but compiled with ROCm support. (platforms: linux/amd64, linux/arm64)
ghcr.io/ggerganov/llama.cpp:full-musa: Same as full but compiled with MUSA support. (platforms: linux/amd64)
ghcr.io/ggerganov/llama.cpp:light-musa: Same as light but compiled with MUSA support. (platforms: linux/amd64)
ghcr.io/ggerganov/llama.cpp:server-musa: Same as server but compiled with MUSA support. (platforms: linux/amd64)

The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in .devops/ and the GitHub Action defined in .github/workflows/docker.yml. If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).

Usage

The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.

Replace /path/to/models below with the actual path where you downloaded the models.

docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B

On completion, you are ready to play!

docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512

or with a light image:

docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512

or with a server image:

docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512

Docker With CUDA

Assuming one has the nvidia-container-toolkit properly installed on Linux, or is using a GPU enabled cloud, cuBLAS should be accessible inside the container.

Building Docker locally

docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .

You may want to pass in some different ARGS, depending on the CUDA environment supported by your container host, as well as the GPU architecture.

The defaults are:

CUDA_VERSION set to 12.6.0
CUDA_DOCKER_ARCH set to the cmake build default, which includes all the supported architectures

The resulting images, are essentially the same as the non-CUDA images:

local/llama.cpp:full-cuda: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
local/llama.cpp:light-cuda: This image only includes the main executable file.
local/llama.cpp:server-cuda: This image only includes the server executable file.

Usage

After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the --gpus flag. You will also want to use the --n-gpu-layers flag.

docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1

Docker With MUSA

Assuming one has the mt-container-toolkit properly installed on Linux, muBLAS should be accessible inside the container.

Building Docker locally

docker build -t local/llama.cpp:full-musa -f .devops/full-musa.Dockerfile .
docker build -t local/llama.cpp:light-musa -f .devops/llama-cli-musa.Dockerfile .
docker build -t local/llama.cpp:server-musa -f .devops/llama-server-musa.Dockerfile .

You may want to pass in some different ARGS, depending on the MUSA environment supported by your container host, as well as the GPU architecture.

The defaults are:

MUSA_VERSION set to rc3.1.0

The resulting images, are essentially the same as the non-MUSA images:

local/llama.cpp:full-musa: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
local/llama.cpp:light-musa: This image only includes the main executable file.
local/llama.cpp:server-musa: This image only includes the server executable file.

Usage

After building locally, Usage is similar to the non-MUSA examples, but you'll need to set mthreads as default Docker runtime. This can be done by executing (cd /usr/bin/musa && sudo ./docker setup $PWD) and verifying the changes by executing docker info | grep mthreads on the host machine. You will also want to use the --n-gpu-layers flag.

docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1