Spaces:
Runtime error
Runtime error
File size: 4,870 Bytes
8655a4b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# OpenAI-Compatible RESTful APIs
FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs.
The FastChat server is compatible with both [openai-python](https://github.com/openai/openai-python) library and cURL commands.
The following OpenAI APIs are supported:
- Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat)
- Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
- Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings)
The REST API can be seamlessly operated from Google Colab, as demonstrated in the [FastChat_API_GoogleColab.ipynb](https://github.com/lm-sys/FastChat/blob/main/playground/FastChat_API_GoogleColab.ipynb) notebook, available in our repository. This notebook provides a practical example of how to utilize the API effectively within the Google Colab environment.
## RESTful API Server
First, launch the controller
```bash
python3 -m fastchat.serve.controller
```
Then, launch the model worker(s)
```bash
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
```
Finally, launch the RESTful API server
```bash
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
Now, let us test the API server.
### OpenAI Official SDK
The goal of `openai_api_server.py` is to implement a fully OpenAI-compatible API server, so the models can be used directly with [openai-python](https://github.com/openai/openai-python) library.
First, install OpenAI python package >= 1.0:
```bash
pip install --upgrade openai
```
Then, interact with the Vicuna model:
```python
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
model = "vicuna-7b-v1.5"
prompt = "Once upon a time"
# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)
# create a chat completion
completion = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)
```
Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py). If your api server is behind a proxy you'll need to turn off buffering, you can do so in Nginx by setting `proxy_buffering off;` in the location block for the proxy.
### cURL
cURL is another good tool for observing the output of the api.
List Models:
```bash
curl http://localhost:8000/v1/models
```
Chat Completions:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
```
Text Completions:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
}'
```
Embeddings:
```bash
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"input": "Hello world!"
}'
```
### Running multiple
If you want to run multiple models on the same machine and in the same process,
you can replace the `model_worker` step above with a multi model variant:
```bash
python3 -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.5 \
--model-names vicuna-7b-v1.5 \
--model-path lmsys/longchat-7b-16k \
--model-names longchat-7b-16k
```
This loads both models into the same accelerator and in the same process. This
works best when using a Peft model that triggers the `PeftModelAdapter`.
TODO: Base model weight optimization will be fixed once [this
Peft](https://github.com/huggingface/peft/issues/430) issue is resolved.
## LangChain Support
This OpenAI-compatible API server supports LangChain. See [LangChain Integration](langchain_integration.md) for details.
## Adjusting Environment Variables
### Timeout
By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable:
```bash
export FASTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds>
```
### Batch size
If you meet the following OOM error while creating embeddings. You can use a smaller batch size by setting
```bash
export FASTCHAT_WORKER_API_EMBEDDING_BATCH_SIZE=1
```
## Todos
Some features to be implemented:
- [ ] Support more parameters like `logprobs`, `logit_bias`, `user`, `presence_penalty` and `frequency_penalty`
- [ ] Model details (permissions, owner and create time)
- [ ] Edits API
- [ ] Rate Limitation Settings
|