--- language: - en license: llama3 tags: - Llama-3 - instruct - finetune - chatml - gpt4 - synthetic data - distillation - function calling - json mode - axolotl - roleplaying - chat base_model: meta-llama/Meta-Llama-3.1-8B widget: - example_title: Example 1 messages: - role: system content: >- You are a sentient, superintelligent artificial general intelligence, here to teach and assist me. - role: user content: What is the meaning of life? model-index: - name: VinkuraAI/Kuno-K1-Llama-3.1-8B results: [] library_name: transformers --- # Kuno-K1 - Llama-3.1 8B ## Model Description Kuno-K1 is a highly advanced Large Language Model (LLM) designed to provide exceptional conversational capabilities, problem-solving skills, and informative responses. Developed by Vinkura AI, this model builds upon the successes of its predecessors, offering enhanced performance and versatility. # Architecture Kuno-K1 is based on the Llama-3.1 architecture, leveraging its robust framework to deliver superior language understanding and generation capabilities. # Prompt Format Kuno-K1 utilizes ChatML, a structured prompt format enabling efficient and effective conversations. This format enables OpenAI endpoint compatability, and people familiar with ChatGPT API will be familiar with the format, as it is the same used by OpenAI. Prompt with system instruction (Use whatever system prompt you like, this is just an example!): ``` <|im_start|>system You are Kuno-K1, a highly advanced artificial intelligence designed to provide exceptional assistance and insightful responses. Your purpose is to help users with their queries, while maintaining a friendly and informative demeanor.<|im_end|> <|im_start|>user Hello, who are you?<|im_end|> <|im_start|>assistant Hi! I'm Kuno-K1, your reliable AI companion. I'm here to provide information, answer questions, and engage in productive conversations to assist you in any way I can.<|im_end|> ``` This prompt is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating), which means you can format messages using the `tokenizer.apply_chat_template()` method: ```python messages = [ {"role": "system", "content": "You are Kuno K1."}, {"role": "user", "content": "Hello, who are you?"} ] gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt") model.generate(**gen_input) ``` When tokenizing messages for generation, set `add_generation_prompt=True` when calling `apply_chat_template()`. This will append `<|im_start|>assistant\n` to your prompt, to ensure that the model continues with an assistant response. To utilize the prompt format without a system prompt, simply leave the line out. ## Prompt Format for Function Calling Our model was trained on specific system prompts and structures for Function Calling. You should use the system role with this message, followed by a function signature json as this example shows here. ``` <|im_start|>system You are a function-calling AI model that utilizes tools within XML tags. Your task is to call one or more functions to assist with user queries without making assumptions about input values. Available Tools: { "type": "function", "function": { "name": "get_stock_fundamentals", "description": "get_stock_fundamentals(symbol: str) -> dict", "parameters": { "type": "object", "properties": { "symbol": {"type": "string"} }, "required": ["symbol"] } } } Function Signature: get_stock_fundamentals(symbol: str) -> dict * Retrieves fundamental data for a given stock symbol using yfinance API. Parameters: - symbol (str): The stock symbol. Returns: - dict: A dictionary containing fundamental data, including: - symbol - company_name - sector - industry - market_cap - pe_ratio - pb_ratio - dividend_yield - eps - beta - 52_week_high - 52_week_low JSON Schema for Tool Calls: { "properties": { "arguments": {"title": "Arguments", "type": "object"}, "name": {"title": "Name", "type": "string"} }, "required": ["arguments", "name"], "title": "FunctionCall", "type": "object" } Return Format: { "arguments": , "name": } <|im_end|> ``` # Inference Here is example code using HuggingFace Transformers to inference the model ```python # Code to inference Hermes with HF Transformers # Requires pytorch, transformers, bitsandbytes, sentencepiece, protobuf, and flash-attn packages import torch from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaForCausalLM import bitsandbytes, flash_attn tokenizer = AutoTokenizer.from_pretrained('VinkuraAI/Kuno-K1-Llama-3.1-8B', trust_remote_code=True) model = LlamaForCausalLM.from_pretrained( "VinkuraAI/Kuno-K1-Llama-3.1-8B", torch_dtype=torch.float16, device_map="auto", load_in_8bit=False, load_in_4bit=True, use_flash_attention_2=True ) prompts = [ """<|im_start|>system You are a mystical, avant-garde storyteller, weaving tales of wonder and chaos.<|im_end|> <|im_start|>user Write a surreal short story about a lone wolf discovering a hidden vinyl record that summons an otherworldly jazz band to save the city from an existential threat.<|im_end|> <|im_start|>assistant""", ] for chat in prompts: print(chat) input_ids = tokenizer(chat, return_tensors="pt").input_ids.to("cuda") generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id) response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True) print(f"Response: {response}") ``` You can also run this model with vLLM, by running the following in your terminal after `pip install vllm` `vllm serve VinkuraAI/Kuno-K1-Llama-3.1-8B` | Average: 23.49 IFEval (0-Shot): 61.70 BBH (3-Shot): 30.72 MATH Lvl 5 (4-Shot): 4.76 GPQA (0-shot): 6.38 MuSR (0-shot): 13.62 MMLU-PRO (5-shot): 23.77