created README File
Browse filesCreated Simple Llama2 fintuned chat model which is trained on mlabonne/guanaco-llama2-1k dataset . This model easily load on Colab as it requires 10-13GB GPU RAM for Inference.
Below is Code for Inference
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
# Run text generation pipeline with our next model
prompt = "How beer is manufactured?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<'s'>[INST] {prompt} [/INST]") ###only s between < and >
print(result[0]['generated_text'])