Synatra-kiqu-10.7b-awq

Model creator: Jeonghwan Park
Original model: maywell/Synatra-kiqu-10.7B

Description

This repo contains AWQ model files for maywell/Synatra-kiqu-10.7B.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

It is supported by:

Text Generation Webui - using Loader: AutoAWQ
vLLM - Llama and Mistral models only
Hugging Face Text Generation Inference (TGI)
Transformers version 4.35.0 and later, from any code or client that supports Transformers
AutoAWQ - for use from Python code

Using OpenAI Chat API with vLLM

Documentation on installing and using vLLM can be found here.

Please ensure you are using vLLM version 0.2 or later.
When using vLLM as a server, pass the --quantization awq parameter.

Start the OpenAI-Compatible Server:

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API

python3 -m vllm.entrypoints.openai.api_server --model Copycats/Synatra-kiqu-10.7B-awq --quantization awq --dtype half

--model: huggingface model path
--quantization: ”awq”
--dtype: “half” for FP16. Recommended for AWQ quantization.

Querying the model using OpenAI Chat API:

You can use the create chat completion endpoint to communicate with the model in a chat-like interface:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Copycats/Synatra-kiqu-10.7B-awq",
        "messages": [
            {"role": "system", "content": "당신은 사용자의 질문에 친절하게 답변하는 어시스턴트입니다."},
            {"role": "user", "content": "괜스레 슬퍼서 눈물이 나면 어떻게 하나요?"}
        ]
    }'

Python Client Example:

Using the openai python package, you can also communicate with the model in a chat-like manner:

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Copycats/Synatra-kiqu-10.7B-awq",
    messages=[
        {"role": "system", "content": "당신은 사용자의 질문에 친절하게 답변하는 어시스턴트입니다."},
        {"role": "user", "content": "괜스레 슬퍼서 눈물이 나면 어떻게 하나요?"},
    ]
)
print("Chat response:", chat_response)

Copycats
/

Synatra-kiqu-10.7B-AWQ