--- base_model: - meta-llama/Llama-3.2-1B-Instruct language: - en library_name: transformers license: llama3.2 tags: - llama-3 - llama - meta - facebook - transformers - llama-cpp --- # SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF ### Install the package Run one of the following commands, according to your system: ```shell # Base ctransformers with no GPU acceleration pip install llama-cpp-python # With NVidia CUDA acceleration CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python # Or with OpenBLAS acceleration CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python # Or with CLBLast acceleration CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python # Or with AMD ROCm GPU acceleration (Linux only) CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python # Or with Metal GPU acceleration for macOS systems only CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA: $env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on" pip install llama-cpp-python ``` ## For Inference ## Download ``` from huggingface_hub import hf_hub_download REPO_ID = "SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF" FILENAME = "llama-3.2-1b-instruct-q4_k_m.gguf" hf_hub_download(repo_id=REPO_ID, filename=FILENAME,local_dir="./") ``` ### Code: ``` from llama_cpp import Llama # Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system. llm = Llama( model_path="./llama-3.2-1b-instruct-q4_k_m.gguf", # Download the model file first n_ctx=4096, n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available ) prompt = """ <|start_header_id|>system<|end_header_id|> You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose. If none of the functions can be used, point it out. If the given question lacks the parameters required by the function,also point it out. You should only return the function call in tools call sections. If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)] You SHOULD NOT include any other text in the response. Here is a list of functions in JSON format that you can invoke.[ { "name": "get_user_info", "description": "Retrieve details for a specific user by their unique identifier. Note that the provided function is in Python 3 syntax.", "parameters": { "type": "dict", "required": [ "user_id" ], "properties": { "user_id": { "type": "integer", "description": "The unique identifier of the user. It is used to fetch the specific user details from the database." }, "special": { "type": "string", "description": "Any special information or parameters that need to be considered while fetching user details.", "default": "none" } } } } ] <|eot_id|><|start_header_id|>user<|end_header_id|> Can you retrieve the details for the user with the ID 7890, who has black as their special request?<|eot_id|><|start_header_id|>assistant<|end_header_id|> """ output = llm(prompt, max_tokens=100, temperature=0.001 ) ``` ## Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) ```bash brew install llama.cpp ``` Invoke the llama.cpp server or the CLI. ### CLI: ```bash llama-cli --hf-repo SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file llama-3.2-1b-instruct-q4_k_m.gguf -p "The meaning to life and the universe is" ``` ### Server: ```bash llama-server --hf-repo SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file llama-3.2-1b-instruct-q4_k_m.gguf -c 2048 ``` Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). ``` cd llama.cpp && LLAMA_CURL=1 make ``` Step 3: Run inference through the main binary. ``` ./llama-cli --hf-repo SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file llama-3.2-1b-instruct-q4_k_m.gguf -p "The meaning to life and the universe is" ``` or ``` ./llama-server --hf-repo SalmanFaroz/Llama-3.2-1B-Instruct-Q4_K_M-GGUF --hf-file llama-3.2-1b-instruct-q4_k_m.gguf -c 2048 ```