--- base_model: unsloth/meta-llama-3.1-8b-bnb-4bit language: - en license: apache-2.0 datasets: - Salesforce/xlam-function-calling-60k tags: - text-generation-inference - transformers - unsloth - llama - trl --- # Meta-Llama-3.1-8B-Unsloth-2x-faster-finetuning-GGUF-by-skk **Model Description** This model is a fine-tuned version of Meta-Llama-3.1-8B, optimized for faster inference and efficient model adaptation. Fine-tuning was performed using Unsloth, Low-Rank Adaptation (LoRA), and 4-bit quantization. The model is designed to provide enhanced, context-aware, and relevant interactions for various applications. **Developed by:** Shailesh Kumar Khanchandani **Shared by:** Shailesh Kumar Khanchandani **Model type:** Causal Language Model **Language(s) (NLP):** English **Finetuned from model:** Meta-Llama-3.1-8B # Meta-Llama-3.1-8B-Unsloth-2x-faster-finetuning-GGUF-by-skk This repository contains the Meta-Llama-3.1-8B-Unsloth-2x-faster-finetuning-GGUF model, optimized for faster inference. ## Getting Started Use the following Python code to get started with the model: ```python %%capture # Installs Unsloth, Xformers (Flash Attention) and all other packages! !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes from unsloth import FastLanguageModel import torch # Define the dtype you want to use dtype = torch.float16 # Example: using float16 for lower memory usage # Set load_in_4bit to True or False depending on your requirements load_in_4bit = True # Or False if you don't want to load in 4-bit # Verify the model name is correct and exists on Hugging Face Model Hub model_name = "skkjodhpur/Meta-Llama-3.1-8B-Unsloth-2x-faster-finetuning-GGUF-by-skk" # Check if the model exists, if not, you may need to adjust the model name !curl -s https://huggingface.co/{model_name}/resolve/main/config.json | jq . model, tokenizer = FastLanguageModel.from_pretrained( model_name = model_name, max_seq_length = 2048, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model) # Enable native 2x faster inference # prompt = You MUST copy from above! prompt = """Below is an tools that describes a task, paired with an query that provides further context. Write a answers that appropriately completes the request. ### tools: {} ### query: {} ### answers: {}""" inputs = tokenizer( [ prompt.format( '[{"name": "live_giveaways_by_type", "description": "Retrieve live giveaways from the GamerPower API based on the specified type.", "parameters": {"type": {"description": "The type of giveaways to retrieve (e.g., game, loot, beta).", "type": "str", "default": "game"}}}]', # instruction "Where can I find live giveaways for beta access and games?", # input "", # output - leave this blank for generation! ) ], return_tensors = "pt").to("cuda") from transformers import TextStreamer text_streamer = TextStreamer(tokenizer) _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128) ``` [](https://github.com/unslothai/unsloth) **Usage** To use the model, follow the steps outlined in the code above. This will install the necessary packages, load the model, and set up the tokenizer and inference settings. For any issues or questions, please open an issue in the repository. This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.