- Model is finetuned from pygmalion-6B using LoRA and 8-bit quantization on an RTX 2080Ti
- LoRA parameters: r=8, alpha=16, dropout=0.05, bias=None
- All prompts are formatted to conversations pairs of [USER] and [Yaya]. For example:
[User]: Compose a long Gulf War story set in Istanbul, where a content chair encounters Elizabeth Barrett Browning, inspired by the works of Henrik Ibsen.
[Yaya]: As an old and content chair sat in the corner of a bustling coffee house in Istanbul, he witnessed the city's frenzied energy. It was the early 1990s, and the Gulf War had just begun.
There was great political strife in the air, and tensions were high. People rushed about, frantically discussing the war and its potential impact on their lives. [...]
- Load LoRA weights with PEFT model
from transformers import GPTJForCausalLM,AutoTokenizer, GenerationConfig
from peft import PeftModel
lora_weights = 'kietbs/pygmalion_6B_yaya' # Please download the weight, and change this path accordingly
load_in_8bit = True
model = GPTJForCausalLM.from_pretrained(pretrain_name, load_in_8bit=load_in_8bit, device_map='auto', torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model,lora_weights,torch_dtype=torch.float16,device_map={'':0})
model = torch.compile(model)
tokenizer = AutoTokenizer.from_pretrained('pygmalion-6b') #The orginal pretrained
gen_config=GenerationConfig(
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4
)
text = '[User]: What's the best food in Hanoi?''
input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to('cuda')
with torch.no_grad():
output = model.generate(input_ids=input_ids, generation_config=gen_config,return_dict_in_generate=True, output_scores=True,max_new_tokens=256)
s = output.sequences[0]
output = tokenizer.decode(s)
print('Raw:',output)
Output: [User]: What's the best food in Hanoi? [Yaya]: The best food in Hanoi can vary depending on what you're looking for. Some of the most popular dishes include pho, banh mi, banh xeo, and bún chả.<|endoftext|>