Octopus-v2-gguf-awq / README.md

Davidqian123

Update README.md

e03cb0a verified 3 months ago

preview code

raw

history blame contribute delete

No virus

7.21 kB

	---
	license: cc-by-nc-4.0
	base_model: google/gemma-2b
	model-index:
	- name: Octopus-V2-2B
	results: []
	tags:
	- function calling
	- on-device language model
	- android
	inference: false
	space: false
	spaces: false
	language:
	- en
	---
	# Quantized Octopus V2: On-device language model for super agent

	This repo includes two types of quantized models: GGUF and AWQ, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)

	<p align="center" width="100%">
	<a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
	</p>


	# GGUF Qauntization

	To run the models, please download them to your local machine using either git clone or [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/en/guides/download)
	```
	git clone https://huggingface.co/NexaAIDev/Octopus-v2-gguf-awq
	```

	## Run with [llama.cpp](https://github.com/ggerganov/llama.cpp) (Recommended)

	1. Clone and compile:

	```bash
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp
	# Compile the source code:
	make
	```

	2. Execute the Model:

	Run the following command in the terminal:

	```bash
	./main -m ./path/to/octopus-v2-Q4_K_M.gguf -n 256 -p "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
	```

	## Run with [Ollama](https://github.com/ollama/ollama)

	Since our models have not been uploaded to the Ollama server, please download the models and manually import them into Ollama by following these steps:

	1. Install Ollama on your local machine. You can also following the guide from [Ollama GitHub repository](https://github.com/ollama/ollama/blob/main/docs/import.md)

	```bash
	git clone https://github.com/ollama/ollama.git ollama
	```

	2. Locate the local Ollama directory:
	```bash
	cd ollama
	```

	3. Create a `Modelfile` in your directory
	```bash
	touch Modelfile
	```

	4. In the Modelfile, include a `FROM` statement with the path to your local model, and the default parameters:

	```bash
	FROM ./path/to/octopus-v2-Q4_K_M.gguf
	```

	5. Use the following command to add the model to Ollama:
	```bash
	ollama create octopus-v2-Q4_K_M -f Modelfile
	```

	6. Verify that the model has been successfully imported:
	```bash
	ollama ls
	```

	7. Run the mode
	```bash
	ollama run octopus-v2-Q4_K_M "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
	```

	# AWQ Quantization
	Python example:

	```python
	from transformers import AutoTokenizer
	from awq import AutoAWQForCausalLM
	import torch
	import time
	import numpy as np
	def inference(input_text):
	start_time = time.time()
	input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
	input_length = input_ids["input_ids"].shape[1]
	generation_output = model.generate(
	input_ids["input_ids"],
	do_sample=False,
	max_length=1024
	)
	end_time = time.time()
	# Decode only the generated part
	generated_sequence = generation_output[:, input_length:].tolist()
	res = tokenizer.decode(generated_sequence[0])
	latency = end_time - start_time
	num_output_tokens = len(generated_sequence[0])
	throughput = num_output_tokens / latency
	return {"output": res, "latency": latency, "throughput": throughput}
	# Initialize tokenizer and model
	model_id = "/path/to/Octopus-v2-AWQ-NexaAIDev"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
	model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
	trust_remote_code=False, safetensors=True)
	prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
	avg_throughput = []
	for prompt in prompts:
	out = inference(prompt)
	avg_throughput.append(out["throughput"])
	print("nexa model result:\n", out["output"])
	print("avg throughput:", np.mean(avg_throughput))
	```

	# Quantized GGUF & AWQ Models Benchmark

	\| Name \| Quant method \| Bits \| Size \| Response (t/s) \| Use Cases \|
	\| ---------------------- \| ------------ \| ---- \| -------- \| -------------- \| ----------------------------------- \|
	\| Octopus-v2-AWQ \| AWQ \| 4 \| 3.00 GB \| 63.83 \| fast, high quality, recommended \|
	\| Octopus-v2-Q2_K.gguf \| Q2_K \| 2 \| 1.16 GB \| 57.81 \| fast but high loss, not recommended \|
	\| Octopus-v2-Q3_K.gguf \| Q3_K \| 3 \| 1.38 GB \| 57.81 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_S.gguf \| Q3_K_S \| 3 \| 1.19 GB \| 52.13 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_M.gguf \| Q3_K_M \| 3 \| 1.38 GB \| 58.67 \| moderate loss, not very recommended \|
	\| Octopus-v2-Q3_K_L.gguf \| Q3_K_L \| 3 \| 1.47 GB \| 56.92 \| not very recommended \|
	\| Octopus-v2-Q4_0.gguf \| Q4_0 \| 4 \| 1.55 GB \| 68.80 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_1.gguf \| Q4_1 \| 4 \| 1.68 GB \| 68.09 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K.gguf \| Q4_K \| 4 \| 1.63 GB \| 64.70 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K_S.gguf \| Q4_K_S \| 4 \| 1.56 GB \| 62.16 \| fast and accurate, very recommended \|
	\| Octopus-v2-Q4_K_M.gguf \| Q4_K_M \| 4 \| 1.63 GB \| 64.74 \| fast, recommended \|
	\| Octopus-v2-Q5_0.gguf \| Q5_0 \| 5 \| 1.80 GB \| 64.80 \| fast, recommended \|
	\| Octopus-v2-Q5_1.gguf \| Q5_1 \| 5 \| 1.92 GB \| 63.42 \| very big, prefer Q4 \|
	\| Octopus-v2-Q5_K.gguf \| Q5_K \| 5 \| 1.84 GB \| 61.28 \| big, recommended \|
	\| Octopus-v2-Q5_K_S.gguf \| Q5_K_S \| 5 \| 1.80 GB \| 62.16 \| big, recommended \|
	\| Octopus-v2-Q5_K_M.gguf \| Q5_K_M \| 5 \| 1.71 GB \| 61.54 \| big, recommended \|
	\| Octopus-v2-Q6_K.gguf \| Q6_K \| 6 \| 2.06 GB \| 55.94 \| very big, not very recommended \|
	\| Octopus-v2-Q8_0.gguf \| Q8_0 \| 8 \| 2.67 GB \| 56.35 \| very big, not very recommended \|
	\| Octopus-v2-f16.gguf \| f16 \| 16 \| 5.02 GB \| 36.27 \| extremely big \|
	\| Octopus-v2.gguf \| \| \| 10.00 GB \| \| \|

	_Quantized with llama.cpp_


	Acknowledgement:
	We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.