Model Card for alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2

This repo contains 5-bit quantized (using ExLlamaV2) model of Meta's meta-llama/Llama-2-7b-chat-hf

Model Details

Model creator: Meta
Original model: Llama-2-7b-chat-hf

About quantization using ExLlamaV2

ExLlamaV2 github repo: ExLlamaV2 github repo

How to Get Started with the Model

Use the code below to get started with the model.

How to run from Python code

First install the package

# Install ExLLamaV2
!git clone https://github.com/turboderp/exllamav2
!pip install -e exllamav2

Import

from huggingface_hub import login, HfApi, create_repo
from torch import bfloat16
import locale
import torch
import os

set up variables

# Define the model ID for the desired model
model_id = "alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2"
BPW = 5.0

# define variables
model_name =  model_id.split("/")[-1]

Download the quantized model

!git-lfs install
# download the model to loacl directory
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id} {model_name}

Run Inference on quantized model using

# Run model
!python exllamav2/test_inference.py -m {model_name}/ -p "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."


import sys, os

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

import time

# Initialize model and cache

model_directory = "/model_path/Llama-2-7b-chat-hf-5.0-bpw-exl2/"
print("Loading model: " + model_directory)

config = ExLlamaV2Config(model_directory)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

# Initialize generator

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# Generate some text

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

max_new_tokens = 512

generator.warmup()
time_begin = time.time()

output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)

time_end = time.time()
time_total = time_end - time_begin

print(output)

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

alokabhishek
/

Llama-2-7b-chat-hf-5.0-bpw-exl2

Model Card for alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2

Model Details

About quantization using ExLlamaV2

How to Get Started with the Model

How to run from Python code

First install the package

Import

set up variables

Download the quantized model

Run Inference on quantized model using

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Model Card Authors [optional]

Model Card Contact

Collection including alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2

Meta-Llama-2-7b-chat-hf-Quantized