Spaces:

zilongpa
/

llama2-webui

Runtime error

App Files Files Community

zilongpa commited on Jul 29, 2023

Commit

55be9e4

•

1 Parent(s): dc1b9b1

Upload folder using huggingface_hub

Browse files

Files changed (27) hide show

.env +10 -0
.gitattributes +1 -0
.gitignore +8 -0
76437bc4e8bea417641aaa076508098a7158e664c1cecfabfa41df497a27f98c +3 -0
CONTRIBUTING.md +90 -0
LICENSE +21 -0
README.md +213 -8
app.py +322 -0
app_4bit_ggml.py +320 -0
benchmark.py +96 -0
docs/performance.md +19 -0
env_examples/.env.13b_example +10 -0
env_examples/.env.7b_8bit_example +10 -0
env_examples/.env.7b_ggmlv3_q4_0_example +10 -0
env_examples/.env.7b_gptq_example +10 -0
gradio_cached_examples/19/Chatbot/tmpihfsul2n.json +1 -0
gradio_cached_examples/19/Chatbot/tmpj22ucqjj.json +1 -0
gradio_cached_examples/19/log.csv +3 -0
llama2_wrapper/__init__.py +1 -0
llama2_wrapper/__pycache__/__init__.cpython-310.pyc +0 -0
llama2_wrapper/__pycache__/model.cpython-310.pyc +0 -0
llama2_wrapper/model.py +197 -0
poetry.lock +0 -0
pyproject.toml +33 -0
requirements.txt +12 -0
static/screenshot.png +0 -0
tests/__init__.py +0 -0

.env ADDED Viewed

	@@ -0,0 +1,10 @@

+MODEL_PATH = "/path-to/Llama-2-7b-chat-hf"
+LOAD_IN_8BIT = True
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+76437bc4e8bea417641aaa076508098a7158e664c1cecfabfa41df497a27f98c filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+models
+dist
+.DS_Store
+.vscode
+__pycache__
+gradio_cached_examples

76437bc4e8bea417641aaa076508098a7158e664c1cecfabfa41df497a27f98c ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76437bc4e8bea417641aaa076508098a7158e664c1cecfabfa41df497a27f98c
+size 3825517184

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# Contributing to [llama2-webui](https://github.com/liltom-eth/llama2-webui)
+We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
+- Reporting a bug
+- Proposing new features
+- Discussing the current state of the code
+- Update README.md
+- Submitting a PR
+## Using GitHub's [issues](https://github.com/liltom-eth/llama2-webui/issues)
+We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/liltom-eth/llama2-webui/issues). It's that easy!
+Thanks for **[jlb1504](https://github.com/jlb1504)** for reporting the [first issue](https://github.com/liltom-eth/llama2-webui/issues/1)!
+**Great Bug Reports** tend to have:
+- A quick summary and/or background
+- Steps to reproduce
+  - Be specific!
+  - Give a sample code if you can.
+- What you expected would happen
+- What actually happens
+- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
+Proposing new features are also welcome.
+## Pull Request
+All pull requests are welcome. For example, you update the `README.md` to help users to better understand the usage.
+### Clone the repository
+1. Create a user account on GitHub if you do not already have one.
+2. Fork the project [repository](https://github.com/liltom-eth/llama2-webui): click on the *Fork* button near the top of the page. This creates a copy of the code under your account on GitHub.
+3. Clone this copy to your local disk:
+   ```
+   git clone [email protected]:liltom-eth/llama2-webui.git
+   cd llama2-webui
+   ```
+### Implement your changes
+1. Create a branch to hold your changes:
+   ```
+   git checkout -b my-feature
+   ```
+   and start making changes. Never work on the main branch!
+2. Start your work on this branch.
+3. When you’re done editing, do:
+   ```
+   git add <MODIFIED FILES>
+   git commit
+   ```
+   to record your changes in [git](https://git-scm.com/).
+### Submit your contribution
+1. If everything works fine, push your local branch to the remote server with:
+   ```
+   git push -u origin my-feature
+   ```
+2. Go to the web page of your fork and click "Create pull request" to send your changes for review.
+   ```{todo}
+      Find more detailed information in [creating a PR]. You might also want to open
+      the PR as a draft first and mark it as ready for review after the feedbacks
+      from the continuous integration (CI) system or any required fixes.
+   ```
+## License
+By contributing, you agree that your contributions will be licensed under its MIT License.
+## Questions?
+Email us at [[email protected]](mailto:[email protected])

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Tom
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,217 @@
 ---
-title: Llama2 Webui
-emoji: 📈
-colorFrom: yellow
-colorTo: purple
 sdk: gradio
-sdk_version: 3.39.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: llama2-webui
+app_file: app_4bit_ggml.py
 sdk: gradio
+sdk_version: 3.37.0
 ---
+# llama2-webui
+Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac).
+- Supporting all Llama 2 models (7B, 13B, 70B, GPTQ, GGML) with 8-bit, 4-bit mode.
+- Supporting GPU inference with at least 6 GB VRAM, and CPU inference.
+![screenshot](./static/screenshot.png)
+## Features
+- Supporting models: [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), all [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), all [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) ...
+- Supporting model backends
+  - Nvidia GPU: tranformers, [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ)
+    - GPU inference with at least 6 GB VRAM
+  - CPU, Mac/AMD GPU: [llama.cpp](https://github.com/ggerganov/llama.cpp)
+    - CPU inference [Demo](https://twitter.com/liltom_eth/status/1682791729207070720?s=20) on Macbook Air.
+- Web UI interface: gradio
+## Contents
+- [Install](#install)
+- [Download Llama-2 Models](#download-llama-2-models)
+  - [Model List](#model-list)
+  - [Download Script](#download-script)
+- [Usage](#usage)
+  - [Config Examples](#config-examples)
+  - [Start Web UI](#start-web-ui)
+  - [Run on Nvidia GPU](#run-on-nvidia-gpu)
+    - [Run on Low Memory GPU with 8 bit](#run-on-low-memory-gpu-with-8-bit)
+    - [Run on Low Memory GPU with 4 bit](#run-on-low-memory-gpu-with-4-bit)
+  - [Run on CPU](#run-on-cpu)
+    - [Mac GPU and AMD/Nvidia GPU Acceleration](#mac-gpu-and-amdnvidia-gpu-acceleration)
+  - [Benchmark](#benchmark)
+- [Contributing](#contributing)
+- [License](#license)
+## Install
+### Method 1: From [PyPI](https://pypi.org/project/llama2-wrapper/)
+```
+pip install llama2-wrapper
+```
+### Method 2: From Source:
+```
+git clone https://github.com/liltom-eth/llama2-webui.git
+cd llama2-webui
+pip install -r requirements.txt
+```
+### Install Issues:
+`bitsandbytes >= 0.39` may not work on older NVIDIA GPUs. In that case, to use `LOAD_IN_8BIT`, you may have to downgrade like this:
+-  `pip install bitsandbytes==0.38.1`
+`bitsandbytes` also need a special install for Windows:
+```
+pip uninstall bitsandbytes
+pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.0-py3-none-win_amd64.whl
+```
+## Download Llama-2 Models
+Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
+Llama-2-7b-Chat-GPTQ is the GPTQ model files for [Meta's Llama 2 7b Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.
+### Model List
+| Model Name                     | set MODEL_PATH in .env                   | Download URL                                                 |
+| ------------------------------ | ---------------------------------------- | ------------------------------------------------------------ |
+| meta-llama/Llama-2-7b-chat-hf  | /path-to/Llama-2-7b-chat-hf              | [Link](https://huggingface.co/llamaste/Llama-2-7b-chat-hf)   |
+| meta-llama/Llama-2-13b-chat-hf | /path-to/Llama-2-13b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)  |
+| meta-llama/Llama-2-70b-chat-hf | /path-to/Llama-2-70b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-70b-chat-hf)  |
+| meta-llama/Llama-2-7b-hf       | /path-to/Llama-2-7b-hf                   | [Link](https://huggingface.co/meta-llama/Llama-2-7b-hf)      |
+| meta-llama/Llama-2-13b-hf      | /path-to/Llama-2-13b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-13b-hf)     |
+| meta-llama/Llama-2-70b-hf      | /path-to/Llama-2-70b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-70b-hf)     |
+| TheBloke/Llama-2-7b-Chat-GPTQ  | /path-to/Llama-2-7b-Chat-GPTQ            | [Link](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ) |
+| TheBloke/Llama-2-7B-Chat-GGML  | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | [Link](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) |
+| ...                            | ...                                      | ...                                                          |
+Running 4-bit model `Llama-2-7b-Chat-GPTQ` needs GPU with 6GB VRAM.
+Running 4-bit model `llama-2-7b-chat.ggmlv3.q4_0.bin` needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML).
+### Download Script
+These models can be downloaded from the link using CMD like:
+```bash
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone [email protected]:meta-llama/Llama-2-7b-chat-hf
+```
+To download Llama 2 models, you need to request access from [https://ai.meta.com/llama/](https://ai.meta.com/llama/) and also enable access on repos like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main). Requests will be processed in hours.
+For GPTQ models like [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), you can directly download without requesting access.
+For GGML models like [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML), you can directly download without requesting access.
+## Usage
+### Config Examples
+Setup your `MODEL_PATH` and model configs in `.env` file.
+There are some examples in `./env_examples/` folder.
+| Model Setup                       | Example .env                |
+| --------------------------------- | --------------------------- |
+| Llama-2-7b-chat-hf 8-bit on GPU   | .env.7b_8bit_example        |
+| Llama-2-7b-Chat-GPTQ 4-bit on GPU | .env.7b_gptq_example        |
+| Llama-2-7B-Chat-GGML 4bit on CPU  | .env.7b_ggmlv3_q4_0_example |
+| Llama-2-13b-chat-hf on GPU        | .env.13b_example            |
+| ...                               | ...                         |
+### Start Web UI
+Run chatbot with web UI:
+```
+python app.py
+```
+### Run on Nvidia GPU
+The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
+If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
+#### Run on Low Memory GPU with 8 bit
+If you do not have enough memory,  you can set up your `LOAD_IN_8BIT` as `True` in `.env`. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.
+Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).
+#### Run on Low Memory GPU with 4 bit
+If you want to run 4 bit  Llama-2 model like `Llama-2-7b-Chat-GPTQ`,  you can set up your `LOAD_IN_4BIT` as `True` in `.env` like example `.env.7b_gptq_example`.
+Make sure you have downloaded the 4-bit model from `Llama-2-7b-Chat-GPTQ` and set the `MODEL_PATH` and arguments in `.env` file.
+`Llama-2-7b-Chat-GPTQ` can run on a single GPU with 6 GB of VRAM.
+### Run on CPU
+Run Llama-2 model on CPU requires [llama.cpp](https://github.com/ggerganov/llama.cpp) dependency and [llama.cpp Python Bindings](https://github.com/abetlen/llama-cpp-python), which are already installed.
+Download GGML models like `llama-2-7b-chat.ggmlv3.q4_0.bin` following [Download Llama-2 Models](#download-llama-2-models) section. `llama-2-7b-chat.ggmlv3.q4_0.bin` model requires at least 6 GB RAM to run on CPU.
+Set up configs like `.env.7b_ggmlv3_q4_0_example` from `env_examples` as `.env`.
+Run web UI `python app.py` .
+#### Mac GPU and AMD/Nvidia GPU Acceleration
+If you would like to use Mac GPU and AMD/Nvidia GPU for acceleration, check these:
+- [Installation with OpenBLAS / cuBLAS / CLBlast / Metal](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal)
+- [MacOS Install with Metal GPU](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)
+### Benchmark
+Run benchmark script to compute performance on your device:
+```bash
+python benchmark.py
+```
+`benchmark.py` will load the same `.env` as `app.py`.
+Some benchmark performance:
+| Model                | Precision | Device             | GPU VRAM    | Speed (tokens / sec) | load time (s) |
+| -------------------- | --------- | ------------------ | ----------- | -------------------- | ------------- |
+| Llama-2-7b-chat-hf   | 8bit      | NVIDIA RTX 2080 Ti | 7.7 GB VRAM | 3.76                 | 783.87        |
+| Llama-2-7b-Chat-GPTQ | 4 bit     | NVIDIA RTX 2080 Ti | 5.8 GB VRAM | 12.08                | 192.91        |
+| Llama-2-7B-Chat-GGML | 4 bit     | Intel i7-8700      | 5.1GB RAM   | 4.16                 | 105.75        |
+Check / contribute the performance of your device in the full [performance doc](./docs/performance.md).
+## Contributing
+Kindly read our [Contributing Guide](CONTRIBUTING.md) to learn and understand about our development process.
+### All Contributors
+<a href="https://github.com/liltom-eth/llama2-webui/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=liltom-eth/llama2-webui" />
+</a>
+## License
+MIT - see [MIT License](LICENSE)
+This project enables users to adapt it freely for proprietary purposes without any restrictions.
+## Credits
+- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
+- https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat
+- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ
+- [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
+- [https://github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- [https://github.com/PanQiWei/AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)

app.py ADDED Viewed

	@@ -0,0 +1,322 @@

+import os
+from typing import Iterator
+import gradio as gr
+from dotenv import load_dotenv
+from distutils.util import strtobool
+from llama2_wrapper import LLAMA2_WRAPPER
+load_dotenv()
+DEFAULT_SYSTEM_PROMPT = (
+    os.getenv("DEFAULT_SYSTEM_PROMPT")
+    if os.getenv("DEFAULT_SYSTEM_PROMPT") is not None
+    else ""
+)
+MAX_MAX_NEW_TOKENS = (
+    int(os.getenv("MAX_MAX_NEW_TOKENS"))
+    if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+    else 2048
+)
+DEFAULT_MAX_NEW_TOKENS = (
+    int(os.getenv("DEFAULT_MAX_NEW_TOKENS"))
+    if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+    else 1024
+)
+MAX_INPUT_TOKEN_LENGTH = (
+    int(os.getenv("MAX_INPUT_TOKEN_LENGTH"))
+    if os.getenv("MAX_INPUT_TOKEN_LENGTH") is not None
+    else 4000
+)
+MODEL_PATH = os.getenv("MODEL_PATH")
+assert MODEL_PATH is not None, f"MODEL_PATH is required, got: {MODEL_PATH}"
+LOAD_IN_8BIT = bool(strtobool(os.getenv("LOAD_IN_8BIT", "True")))
+LOAD_IN_4BIT = bool(strtobool(os.getenv("LOAD_IN_4BIT", "True")))
+LLAMA_CPP = bool(strtobool(os.getenv("LLAMA_CPP", "True")))
+if LLAMA_CPP:
+    print("Running on CPU with llama.cpp.")
+else:
+    import torch
+    if torch.cuda.is_available():
+        print("Running on GPU with torch transformers.")
+    else:
+        print("CUDA not found.")
+config = {
+    "model_name": MODEL_PATH,
+    "load_in_8bit": LOAD_IN_8BIT,
+    "load_in_4bit": LOAD_IN_4BIT,
+    "llama_cpp": LLAMA_CPP,
+    "MAX_INPUT_TOKEN_LENGTH": MAX_INPUT_TOKEN_LENGTH,
+}
+llama2_wrapper = LLAMA2_WRAPPER(config)
+llama2_wrapper.init_tokenizer()
+llama2_wrapper.init_model()
+DESCRIPTION = """
+# llama2-webui
+This is a chatbot based on Llama-2.
+- Supporting models: [Llama-2-7b](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), all [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), all [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) ...
+- Supporting model backends
+  - Nvidia GPU(at least 6 GB VRAM): tranformers, [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ)
+  - CPU(at least 6 GB RAM), Mac/AMD GPU: [llama.cpp](https://github.com/ggerganov/llama.cpp)
+"""
+def clear_and_save_textbox(message: str) -> tuple[str, str]:
+    return "", message
+def display_input(
+    message: str, history: list[tuple[str, str]]
+) -> list[tuple[str, str]]:
+    history.append((message, ""))
+    return history
+def delete_prev_fn(history: list[tuple[str, str]]) -> tuple[list[tuple[str, str]], str]:
+    try:
+        message, _ = history.pop()
+    except IndexError:
+        message = ""
+    return history, message or ""
+def generate(
+    message: str,
+    history_with_input: list[tuple[str, str]],
+    system_prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    top_k: int,
+) -> Iterator[list[tuple[str, str]]]:
+    if max_new_tokens > MAX_MAX_NEW_TOKENS:
+        raise ValueError
+    history = history_with_input[:-1]
+    generator = llama2_wrapper.run(
+        message, history, system_prompt, max_new_tokens, temperature, top_p, top_k
+    )
+    try:
+        first_response = next(generator)
+        yield history + [(message, first_response)]
+    except StopIteration:
+        yield history + [(message, "")]
+    for response in generator:
+        yield history + [(message, response)]
+def process_example(message: str) -> tuple[str, list[tuple[str, str]]]:
+    generator = generate(message, [], DEFAULT_SYSTEM_PROMPT, 1024, 1, 0.95, 50)
+    for x in generator:
+        pass
+    return "", x
+def check_input_token_length(
+    message: str, chat_history: list[tuple[str, str]], system_prompt: str
+) -> None:
+    input_token_length = llama2_wrapper.get_input_token_length(
+        message, chat_history, system_prompt
+    )
+    if input_token_length > MAX_INPUT_TOKEN_LENGTH:
+        raise gr.Error(
+            f"The accumulated input is too long ({input_token_length} > {MAX_INPUT_TOKEN_LENGTH}). Clear your chat history and try again."
+        )
+with gr.Blocks(css="style.css") as demo:
+    gr.Markdown(DESCRIPTION)
+    with gr.Group():
+        chatbot = gr.Chatbot(label="Chatbot")
+        with gr.Row():
+            textbox = gr.Textbox(
+                container=False,
+                show_label=False,
+                placeholder="Type a message...",
+                scale=10,
+            )
+            submit_button = gr.Button("Submit", variant="primary", scale=1, min_width=0)
+    with gr.Row():
+        retry_button = gr.Button("🔄  Retry", variant="secondary")
+        undo_button = gr.Button("↩️ Undo", variant="secondary")
+        clear_button = gr.Button("🗑️  Clear", variant="secondary")
+    saved_input = gr.State()
+    with gr.Accordion(label="Advanced options", open=False):
+        system_prompt = gr.Textbox(
+            label="System prompt", value=DEFAULT_SYSTEM_PROMPT, lines=6
+        )
+        max_new_tokens = gr.Slider(
+            label="Max new tokens",
+            minimum=1,
+            maximum=MAX_MAX_NEW_TOKENS,
+            step=1,
+            value=DEFAULT_MAX_NEW_TOKENS,
+        )
+        temperature = gr.Slider(
+            label="Temperature",
+            minimum=0.1,
+            maximum=4.0,
+            step=0.1,
+            value=1.0,
+        )
+        top_p = gr.Slider(
+            label="Top-p (nucleus sampling)",
+            minimum=0.05,
+            maximum=1.0,
+            step=0.05,
+            value=0.95,
+        )
+        top_k = gr.Slider(
+            label="Top-k",
+            minimum=1,
+            maximum=1000,
+            step=1,
+            value=50,
+        )
+    gr.Examples(
+        examples=[
+            "Hello there! How are you doing?",
+            "Can you explain briefly to me what is the Python programming language?",
+            "Explain the plot of Cinderella in a sentence.",
+            "How many hours does it take a man to eat a Helicopter?",
+            "Write a 100-word article on 'Benefits of Open-Source in AI research'",
+        ],
+        inputs=textbox,
+        outputs=[textbox, chatbot],
+        fn=process_example,
+        cache_examples=True,
+    )
+    textbox.submit(
+        fn=clear_and_save_textbox,
+        inputs=textbox,
+        outputs=[textbox, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=check_input_token_length,
+        inputs=[saved_input, chatbot, system_prompt],
+        api_name=False,
+        queue=False,
+    ).success(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    button_event_preprocess = (
+        submit_button.click(
+            fn=clear_and_save_textbox,
+            inputs=textbox,
+            outputs=[textbox, saved_input],
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=display_input,
+            inputs=[saved_input, chatbot],
+            outputs=chatbot,
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=check_input_token_length,
+            inputs=[saved_input, chatbot, system_prompt],
+            api_name=False,
+            queue=False,
+        )
+        .success(
+            fn=generate,
+            inputs=[
+                saved_input,
+                chatbot,
+                system_prompt,
+                max_new_tokens,
+                temperature,
+                top_p,
+                top_k,
+            ],
+            outputs=chatbot,
+            api_name=False,
+        )
+    )
+    retry_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    undo_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=lambda x: x,
+        inputs=[saved_input],
+        outputs=textbox,
+        api_name=False,
+        queue=False,
+    )
+    clear_button.click(
+        fn=lambda: ([], ""),
+        outputs=[chatbot, saved_input],
+        queue=False,
+        api_name=False,
+    )
+demo.queue(max_size=20).launch()

app_4bit_ggml.py ADDED Viewed

	@@ -0,0 +1,320 @@

+import argparse
+import os
+from typing import Iterator
+import gradio as gr
+# from dotenv import load_dotenv
+from distutils.util import strtobool
+from llama2_wrapper import LLAMA2_WRAPPER
+parser = argparse.ArgumentParser()
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
+parser.add_argument('--model_path', type=str, required=True, default='',
+                    help='model_path .')
+parser.add_argument('--system_prompt', type=str, required=False, default=DEFAULT_SYSTEM_PROMPT,
+                    help='Inference server Appkey. Default is .')
+parser.add_argument('--max_max_new_tokens', type=int, default=2048, metavar='NUMBER',
+                        help='maximum new tokens (default: 2048)')
+FLAGS = parser.parse_args()
+DEFAULT_SYSTEM_PROMPT = FLAGS.system_prompt
+MAX_MAX_NEW_TOKENS = FLAGS.max_max_new_tokens
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+MODEL_PATH = FLAGS.model_path
+assert MODEL_PATH is not None, f"MODEL_PATH is required, got: {MODEL_PATH}"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = True
+LLAMA_CPP = True
+if LLAMA_CPP:
+    print("Running on CPU with llama.cpp.")
+else:
+    import torch
+    if torch.cuda.is_available():
+        print("Running on GPU with torch transformers.")
+    else:
+        print("CUDA not found.")
+config = {
+    "model_name": MODEL_PATH,
+    "load_in_8bit": LOAD_IN_8BIT,
+    "load_in_4bit": LOAD_IN_4BIT,
+    "llama_cpp": LLAMA_CPP,
+    "MAX_INPUT_TOKEN_LENGTH": MAX_INPUT_TOKEN_LENGTH,
+}
+llama2_wrapper = LLAMA2_WRAPPER(config)
+llama2_wrapper.init_tokenizer()
+llama2_wrapper.init_model()
+DESCRIPTION = """
+# Llama2-Chinese-7b-webui
+这是一个[Llama2-Chinese-2-7b](https://github.com/FlagAlpha/Llama2-Chinese)的推理界面。
+- 支持的模型: [Llama-2-GGML](https://huggingface.co/FlagAlpha/Llama2-Chinese-7b-Chat-GGML)
+- 支持的后端
+  - CPU(at least 6 GB RAM), Mac/AMD
+"""
+def clear_and_save_textbox(message: str) -> tuple[str, str]:
+    return "", message
+def display_input(
+    message: str, history: list[tuple[str, str]]
+) -> list[tuple[str, str]]:
+    history.append((message, ""))
+    return history
+def delete_prev_fn(history: list[tuple[str, str]]) -> tuple[list[tuple[str, str]], str]:
+    try:
+        message, _ = history.pop()
+    except IndexError:
+        message = ""
+    return history, message or ""
+def generate(
+    message: str,
+    history_with_input: list[tuple[str, str]],
+    system_prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    top_k: int,
+) -> Iterator[list[tuple[str, str]]]:
+    if max_new_tokens > MAX_MAX_NEW_TOKENS:
+        raise ValueError
+    history = history_with_input[:-1]
+    generator = llama2_wrapper.run(
+        message, history, system_prompt, max_new_tokens, temperature, top_p, top_k
+    )
+    try:
+        first_response = next(generator)
+        yield history + [(message, first_response)]
+    except StopIteration:
+        yield history + [(message, "")]
+    for response in generator:
+        yield history + [(message, response)]
+def process_example(message: str) -> tuple[str, list[tuple[str, str]]]:
+    generator = generate(message, [], DEFAULT_SYSTEM_PROMPT, 1024, 1, 0.95, 50)
+    for x in generator:
+        pass
+    return "", x
+def check_input_token_length(
+    message: str, chat_history: list[tuple[str, str]], system_prompt: str
+) -> None:
+    input_token_length = llama2_wrapper.get_input_token_length(
+        message, chat_history, system_prompt
+    )
+    if input_token_length > MAX_INPUT_TOKEN_LENGTH:
+        raise gr.Error(
+            f"The accumulated input is too long ({input_token_length} > {MAX_INPUT_TOKEN_LENGTH}). Clear your chat history and try again."
+        )
+with gr.Blocks(css="style.css") as demo:
+    gr.Markdown(DESCRIPTION)
+    with gr.Group():
+        chatbot = gr.Chatbot(label="Chatbot")
+        with gr.Row():
+            textbox = gr.Textbox(
+                container=False,
+                show_label=False,
+                placeholder="Type a message...",
+                scale=10,
+            )
+            submit_button = gr.Button("Submit", variant="primary", scale=1, min_width=0)
+    with gr.Row():
+        retry_button = gr.Button("🔄  Retry", variant="secondary")
+        undo_button = gr.Button("↩️ Undo", variant="secondary")
+        clear_button = gr.Button("🗑️  Clear", variant="secondary")
+    saved_input = gr.State()
+    with gr.Accordion(label="Advanced options", open=False):
+        system_prompt = gr.Textbox(
+            label="System prompt", value=DEFAULT_SYSTEM_PROMPT, lines=6
+        )
+        max_new_tokens = gr.Slider(
+            label="Max new tokens",
+            minimum=1,
+            maximum=MAX_MAX_NEW_TOKENS,
+            step=1,
+            value=DEFAULT_MAX_NEW_TOKENS,
+        )
+        temperature = gr.Slider(
+            label="Temperature",
+            minimum=0.1,
+            maximum=4.0,
+            step=0.1,
+            value=1.0,
+        )
+        top_p = gr.Slider(
+            label="Top-p (nucleus sampling)",
+            minimum=0.05,
+            maximum=1.0,
+            step=0.05,
+            value=0.95,
+        )
+        top_k = gr.Slider(
+            label="Top-k",
+            minimum=1,
+            maximum=1000,
+            step=1,
+            value=50,
+        )
+    gr.Examples(
+        examples=[
+            "Hello there! How are you doing?",
+            "Can you explain briefly to me what is the Python programming language?",
+        ],
+        inputs=textbox,
+        outputs=[textbox, chatbot],
+        fn=process_example,
+        cache_examples=True,
+    )
+    textbox.submit(
+        fn=clear_and_save_textbox,
+        inputs=textbox,
+        outputs=[textbox, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=check_input_token_length,
+        inputs=[saved_input, chatbot, system_prompt],
+        api_name=False,
+        queue=False,
+    ).success(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    button_event_preprocess = (
+        submit_button.click(
+            fn=clear_and_save_textbox,
+            inputs=textbox,
+            outputs=[textbox, saved_input],
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=display_input,
+            inputs=[saved_input, chatbot],
+            outputs=chatbot,
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=check_input_token_length,
+            inputs=[saved_input, chatbot, system_prompt],
+            api_name=False,
+            queue=False,
+        )
+        .success(
+            fn=generate,
+            inputs=[
+                saved_input,
+                chatbot,
+                system_prompt,
+                max_new_tokens,
+                temperature,
+                top_p,
+                top_k,
+            ],
+            outputs=chatbot,
+            api_name=False,
+        )
+    )
+    retry_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    undo_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=lambda x: x,
+        inputs=[saved_input],
+        outputs=textbox,
+        api_name=False,
+        queue=False,
+    )
+    clear_button.click(
+        fn=lambda: ([], ""),
+        outputs=[chatbot, saved_input],
+        queue=False,
+        api_name=False,
+    )
+demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=8090, share=True)

benchmark.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import os
+import time
+from dotenv import load_dotenv
+from distutils.util import strtobool
+from llama2_wrapper import LLAMA2_WRAPPER
+def main():
+    load_dotenv()
+    DEFAULT_SYSTEM_PROMPT = (
+        os.getenv("DEFAULT_SYSTEM_PROMPT")
+        if os.getenv("DEFAULT_SYSTEM_PROMPT") is not None
+        else ""
+    )
+    MAX_MAX_NEW_TOKENS = (
+        int(os.getenv("MAX_MAX_NEW_TOKENS"))
+        if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+        else 2048
+    )
+    DEFAULT_MAX_NEW_TOKENS = (
+        int(os.getenv("DEFAULT_MAX_NEW_TOKENS"))
+        if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+        else 1024
+    )
+    MAX_INPUT_TOKEN_LENGTH = (
+        int(os.getenv("MAX_INPUT_TOKEN_LENGTH"))
+        if os.getenv("MAX_INPUT_TOKEN_LENGTH") is not None
+        else 4000
+    )
+    MODEL_PATH = os.getenv("MODEL_PATH")
+    assert MODEL_PATH is not None, f"MODEL_PATH is required, got: {MODEL_PATH}"
+    LOAD_IN_8BIT = bool(strtobool(os.getenv("LOAD_IN_8BIT", "True")))
+    LOAD_IN_4BIT = bool(strtobool(os.getenv("LOAD_IN_4BIT", "True")))
+    LLAMA_CPP = bool(strtobool(os.getenv("LLAMA_CPP", "True")))
+    if LLAMA_CPP:
+        print("Running on CPU with llama.cpp.")
+    else:
+        import torch
+        if torch.cuda.is_available():
+            print("Running on GPU with torch transformers.")
+        else:
+            print("CUDA not found.")
+    config = {
+        "model_name": MODEL_PATH,
+        "load_in_8bit": LOAD_IN_8BIT,
+        "load_in_4bit": LOAD_IN_4BIT,
+        "llama_cpp": LLAMA_CPP,
+        "MAX_INPUT_TOKEN_LENGTH": MAX_INPUT_TOKEN_LENGTH,
+    }
+    tic = time.perf_counter()
+    llama2_wrapper = LLAMA2_WRAPPER(config)
+    llama2_wrapper.init_tokenizer()
+    llama2_wrapper.init_model()
+    toc = time.perf_counter()
+    print(f"Initialize the model in {toc - tic:0.4f} seconds.")
+    example = "Can you explain briefly to me what is the Python programming language?"
+    generator = llama2_wrapper.run(
+        example, [], DEFAULT_SYSTEM_PROMPT, DEFAULT_MAX_NEW_TOKENS, 1, 0.95, 50
+    )
+    tic = time.perf_counter()
+    try:
+        first_response = next(generator)
+        # history += [(example, first_response)]
+        # print(first_response)
+    except StopIteration:
+        pass
+        # history += [(example, "")]
+        # print(history)
+    for response in generator:
+        # history += [(example, response)]
+        # print(response)
+        pass
+    print(response)
+    toc = time.perf_counter()
+    output_token_length = llama2_wrapper.get_token_length(response)
+    print(f"Generating the out in {toc - tic:0.4f} seconds.")
+    print(f"Speed: {output_token_length / (toc - tic):0.4f} tokens/sec.")
+if __name__ == "__main__":
+    main()

docs/performance.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Benchmark Performance
+## Performance on Nvidia GPU
+| Model                             | Precision | Device | GPU VRAM | Speed (tokens / sec) | load time (s) |
+| --------------------------------- | --------- | ---------- | ---------------------- | ---------------- | ---------------- |
+| Llama-2-7b-chat-hf | 16 bit |  |  |              |              |
+| Llama-2-7b-chat-hf          |   8bit   | NVIDIA RTX 2080 Ti | 7.7 GB VRAM | 3.76 | 783.87 |
+| Llama-2-7b-Chat-GPTQ        |   4 bit   | NVIDIA RTX 2080 Ti | 5.8 GB VRAM  | 12.08 | 192.91 |
+| Llama-2-13b-chat-hf               |   16 bit   |  |                  |                  |                  |
+|  |  | |  | | |
+## Performance on CPU / OpenBLAS / cuBLAS / CLBlast / Metal
+| Model                             | Precision | Device | RAM / GPU VRAM | Speed (tokens / sec) | load time (s) |
+| --------------------------------- | --------- | ---------- | ---------------------- | ---------------- | ---------------- |
+| Llama-2-7B-Chat-GGML |   4 bit   | Intel i7-8700 | 5.1GB RAM       | 4.16 | 105.75 |
+| Llama-2-7B-Chat-GGML |   4 bit   | Apple M1 CPU  |                |                  |                  |

env_examples/.env.13b_example ADDED Viewed

	@@ -0,0 +1,10 @@

+MODEL_PATH = "/path-to/Llama-2-13b-chat-hf"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

env_examples/.env.7b_8bit_example ADDED Viewed

	@@ -0,0 +1,10 @@

+MODEL_PATH = "/path-to/Llama-2-7b-chat-hf"
+LOAD_IN_8BIT = True
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

env_examples/.env.7b_ggmlv3_q4_0_example ADDED Viewed

	@@ -0,0 +1,10 @@

+MODEL_PATH = "/path-to/llama-2-7b-chat.ggmlv3.q4_0.bin"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = True
+LLAMA_CPP = True
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

env_examples/.env.7b_gptq_example ADDED Viewed

	@@ -0,0 +1,10 @@

+MODEL_PATH = "/path-to/Llama-2-7b-Chat-GPTQ"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = True
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

gradio_cached_examples/19/Chatbot/tmpihfsul2n.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ [["Hello there! How are you doing?", "I am an artificial intelligence language model assistant and do not have emotions or feelings as humans do. I am here to help answer any questions in a safe, socially unbiased and positive nature, respectfully. What is on your mind that I can assist you with today?</s></s>"]]

gradio_cached_examples/19/Chatbot/tmpj22ucqjj.json ADDED Viewed

	@@ -0,0 +1 @@

+ [["Can you explain briefly to me what is the Python programming language?", "The Python programming language is a general-purpose, high-level, and versatile programming language that was developed in the late 1980s by Guido van Rossum. It's widely used for various tasks, including scientific computing, web development, data analysis, artificial intelligence, and more.\n\nPython has built-in support for object-oriented programming, which makes it easy to write elegant and modular code. Its syntax is simple and intuitive, making it an excellent choice for beginners as well as experienced programmers. Furthermore, Python has an extensive standard library that provides functionality for various tasks like reading and writing files, handling exceptions, working with regular expressions, connecting to databases, and more. \n\nOverall, the Python language is widely recognized as a \"megabyte of cheese\" - it's easy on your stomach!</s>"]]

gradio_cached_examples/19/log.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+component 0,Chatbot,flag,username,timestamp
+,/root/llama2-webui/gradio_cached_examples/19/Chatbot/tmpihfsul2n.json,,,2023-07-29 15:34:58.289613
+,/root/llama2-webui/gradio_cached_examples/19/Chatbot/tmpj22ucqjj.json,,,2023-07-29 15:38:58.697165

llama2_wrapper/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .model import LLAMA2_WRAPPER, get_prompt

llama2_wrapper/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (203 Bytes). View file

llama2_wrapper/__pycache__/model.cpython-310.pyc ADDED Viewed

Binary file (5.18 kB). View file

llama2_wrapper/model.py ADDED Viewed

	@@ -0,0 +1,197 @@

+# coding:utf-8
+from threading import Thread
+from typing import Any, Iterator
+class LLAMA2_WRAPPER:
+    def __init__(self, config: dict = {}):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+    def init_model(self):
+        if self.model is None:
+            self.model = LLAMA2_WRAPPER.create_llama2_model(
+                self.config,
+            )
+        if not self.config.get("llama_cpp"):
+            self.model.eval()
+    def init_tokenizer(self):
+        if self.tokenizer is None and not self.config.get("llama_cpp"):
+            self.tokenizer = LLAMA2_WRAPPER.create_llama2_tokenizer(self.config)
+    @classmethod
+    def create_llama2_model(cls, config):
+        model_name = config.get("model_name")
+        load_in_8bit = config.get("load_in_8bit", True)
+        load_in_4bit = config.get("load_in_4bit", False)
+        llama_cpp = config.get("llama_cpp", False)
+        if llama_cpp:
+            from llama_cpp import Llama
+            model = Llama(
+                model_path=model_name,
+                n_ctx=config.get("MAX_INPUT_TOKEN_LENGTH"),
+                n_batch=config.get("MAX_INPUT_TOKEN_LENGTH"),
+            )
+        elif load_in_4bit:
+            from auto_gptq import AutoGPTQForCausalLM
+            model = AutoGPTQForCausalLM.from_quantized(
+                model_name,
+                use_safetensors=True,
+                trust_remote_code=True,
+                device="cuda:0",
+                use_triton=False,
+                quantize_config=None,
+            )
+        else:
+            import torch
+            from transformers import AutoModelForCausalLM
+            model = AutoModelForCausalLM.from_pretrained(
+                model_name,
+                device_map="auto",
+                torch_dtype=torch.float16,
+                load_in_8bit=load_in_8bit,
+            )
+        return model
+    @classmethod
+    def create_llama2_tokenizer(cls, config):
+        model_name = config.get("model_name")
+        from transformers import AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        return tokenizer
+    def get_token_length(
+        self,
+        prompt: str,
+    ) -> int:
+        if self.config.get("llama_cpp"):
+            input_ids = self.model.tokenize(bytes(prompt, "utf-8"))
+            return len(input_ids)
+        else:
+            input_ids = self.tokenizer([prompt], return_tensors="np")["input_ids"]
+            return input_ids.shape[-1]
+    def get_input_token_length(
+        self, message: str, chat_history: list[tuple[str, str]], system_prompt: str
+    ) -> int:
+        prompt = get_prompt(message, chat_history, system_prompt)
+        return self.get_token_length(prompt)
+    def generate(
+        self,
+        prompt: str,
+        max_new_tokens: int = 1024,
+        temperature: float = 0.8,
+        top_p: float = 0.95,
+        top_k: int = 50,
+    ) -> Iterator[str]:
+        if self.config.get("llama_cpp"):
+            inputs = self.model.tokenize(bytes(prompt, "utf-8"))
+            generate_kwargs = dict(
+                top_p=top_p,
+                top_k=top_k,
+                temp=temperature,
+            )
+            generator = self.model.generate(inputs, **generate_kwargs)
+            outputs = []
+            answer_message =''
+            new_tokens = []
+            for token in generator:
+                if token!='</s>':
+                    try:
+                        new_tokens.append(token)
+                        b_text = self.model.detokenize(new_tokens)
+                        # b_text = self.model.decode(new_tokens)
+                        answer_message+=str(b_text, encoding="utf-8")
+                        new_tokens = []
+                    except:
+                        pass
+                else:
+                    yield answer_message
+                    break
+                if 'Human:' in answer_message:
+                    answer_message = answer_message.split('Human:')[0]
+                    yield answer_message
+                    break
+                if token == self.model.token_eos():
+                    yield answer_message
+                    break
+                yield answer_message
+        else:
+            from transformers import TextIteratorStreamer
+            inputs = self.tokenizer([prompt], return_tensors="pt").to("cuda")
+            streamer = TextIteratorStreamer(
+                self.tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True
+            )
+            generate_kwargs = dict(
+                inputs,
+                streamer=streamer,
+                max_new_tokens=max_new_tokens,
+                do_sample=True,
+                top_p=top_p,
+                top_k=top_k,
+                temperature=temperature,
+                num_beams=1,
+            )
+            t = Thread(target=self.model.generate, kwargs=generate_kwargs)
+            t.start()
+            outputs = []
+            for text in streamer:
+                outputs.append(text)
+                yield "".join(outputs)
+    def run(
+        self,
+        message: str,
+        chat_history: list[tuple[str, str]],
+        system_prompt: str,
+        max_new_tokens: int = 1024,
+        temperature: float = 0.3,
+        top_p: float = 0.95,
+        top_k: int = 50,
+    ) -> Iterator[str]:
+        prompt = get_prompt(message, chat_history, system_prompt)
+        return self.generate(prompt, max_new_tokens, temperature, top_p, top_k)
+    def __call__(
+        self,
+        prompt: str,
+        **kwargs: Any,
+    ) -> str:
+        if self.config.get("llama_cpp"):
+            return self.model.__call__(prompt, **kwargs)["choices"][0]["text"]
+        else:
+            inputs = self.tokenizer([prompt], return_tensors="pt").input_ids.to("cuda")
+            output = self.model.generate(inputs=inputs, **kwargs)
+            return self.tokenizer.decode(output[0])
+def get_prompt(
+    message: str, chat_history: list[tuple[str, str]], system_prompt: str
+) -> str:
+    prompt = ''
+    for user_input, response in chat_history:
+        prompt += "<s>Human: " + user_input.strip()+"\n</s><s>Assistant: " + response.strip()+"\n</s>"
+    prompt += "<s>Human: " + message.strip() +"\n</s><s>Assistant: "
+    prompt = prompt[-2048:]
+    if len(system_prompt)>0:
+        prompt = '<s>System: '+system_prompt.strip()+'\n</s>'+ prompt
+    return prompt

poetry.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

pyproject.toml ADDED Viewed

	@@ -0,0 +1,33 @@

+[tool.poetry]
+name = "llama2-wrapper"
+version = "0.1.3"
+description = "Running Llama 2 on GPU or CPU from anywhere (Linux/Windows/Mac)."
+authors = ["liltom-eth <[email protected]>"]
+license = "MIT"
+homepage = "https://github.com/liltom-eth/llama2-webui"
+repository = "https://github.com/liltom-eth/llama2-webui"
+packages = [{include = "llama2_wrapper"}]
+[tool.poetry.dependencies]
+python = ">=3.10,<3.13"
+accelerate = "^0.21.0"
+auto-gptq = "0.3.0"
+gradio = "3.37.0"
+protobuf = "3.20.3"
+scipy = "1.11.1"
+sentencepiece = "0.1.99"
+torch = "2.0.1"
+transformers = "4.31.0"
+tqdm = "4.65.0"
+python-dotenv = "1.0.0"
+llama-cpp-python = "^0.1.77"
+bitsandbytes = [
+    {platform = 'linux', version = "0.40.2"},
+    {platform = 'darwin', version = "0.40.2"},
+]
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+accelerate==0.21.0
+auto-gptq==0.3.0
+bitsandbytes==0.40.2
+gradio==3.37.0
+protobuf==3.20.3
+scipy==1.11.1
+sentencepiece==0.1.99
+torch==2.0.1
+transformers==4.31.0
+tqdm==4.65.0
+python-dotenv==1.0.0
+llama-cpp-python== 0.1.77

static/screenshot.png ADDED Viewed

tests/__init__.py ADDED Viewed

File without changes