bingal/Qwen1.5-7B-Chat-llamafile

About llamafile

github
llamafile 中文说明
 The llamafile model collection in modelscope.cn
Qwen1.5-7B-Chat-llamafile in modelscope.cn

Useage

The Windows system has a limitation where it does not support a single exe file over 4GB, so it is necessary to download the llamafile and gguf models separately and run them individually. In addition, you can also use the Windows Subsystem for Linux (WSL) to run it, which can bypass the 4GB limit as well.

Windows

Download llamafile.
Rename the file to llamafile-0.6.2.exe
Download GGUF model qwen1_5-7b-chat-q5_k_m.gguf
Open terminal window, and run: \llamafile-0.6.2.exe .\qwen1_5-7b-chat-q5_k_m.gguf -ngl 9999 --host 0.0.0.0 --port 8080
Open browser to http://127.0.0.1:8080 to start chatting

Linux / macOS

Download model: qwen1.5-7b-chat-q5_k_m.llamafile
Run the model

Add execution permissions: chmod +x ./qwen1.5-7b-chat-q5_k_m.llamafile
Run in terminal: ./qwen1.5-7b-chat-q5_k_m.llamafile
Open browser to http://127.0.0.1:8080 to start chatting

Openai api usage

api url: http://127.0.0.1:8080/v1
Python code:

#!/usr/bin/env python3
from openai import OpenAI
client = OpenAI(
    base_url="http://127.0.0.1:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
    model="LLaMA_CPP",
    messages=[
        {"role": "system", "content": "You are an AI assistant."},
        {"role": "user", "content": "Write a story about dragon"}
    ]
)
print(completion.choices[0].message)

Parameter Description

-ngl 9999 indicates how many layers of the model are placed on the GPU to run, with the rest running on the CPU. If there is no GPU available, it can be set to -ngl 0. The default is 9999, which means everything runs on the GPU (drivers and CUDA runtime environment must be installed).
--host 0.0.0.0 is the hostname for the web service. If only local access is needed, it can be set to --host 127.0.0.1. If set to 0.0.0.0, it can be accessed via IP within the network.
--port 8080 is the port for the web service, with the default being 8080, which can be modified using this parameter.
-t 16 is the number of threads. When running on the CPU, you can set the number of cores to run concurrently based on the CPU core count.
Other parameters can be viewed with --help.