NorGLM commited on
Commit
25ca469
1 Parent(s): c840ef8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md CHANGED
@@ -1,3 +1,93 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - NorGLM/NO-ConvAI2
5
+ language:
6
+ - 'no'
7
+ pipeline_tag: text-generation
8
  ---
9
+
10
+ # Model Card
11
+
12
+ NorGPT-369M-conversation-peft is trained on top of [NorGPT-369M](https://huggingface.co/NorGLM/NorGPT-369M) model on [NO-ConvAI2](https://huggingface.co/datasets/NorGLM/NO-ConvAI2) dataset.
13
+
14
+ Prompt format:
15
+ ```
16
+ Human: {prompt} Robot: |||\n {answer}
17
+ ```
18
+
19
+ Inference prompt:
20
+ ```
21
+ Human: {prompt} Robot: |||\n
22
+ ```
23
+
24
+ ## Run the Model
25
+ ```python
26
+ from peft import PeftModel, PeftConfig
27
+ from transformers import AutoModelForCausalLM, AutoTokenizer
28
+ import torch
29
+ from tqdm.auto import tqdm
30
+
31
+ source_model_id = "NorGLM/NorGPT-369M"
32
+ peft_model_id = "NorGLM/NorGPT-369M-conversation-peft"
33
+
34
+ config = PeftConfig.from_pretrained(peft_model_id)
35
+ model = AutoModelForCausalLM.from_pretrained(source_model_id, device_map='balanced')
36
+
37
+ tokenizer_max_len = 2048
38
+ tokenizer_config = {'pretrained_model_name_or_path': source_model_id,
39
+ 'max_len': tokenizer_max_len}
40
+ tokenizer = tokenizer = AutoTokenizer.from_pretrained(**tokenizer_config)
41
+ tokenizer.pad_token = tokenizer.eos_token
42
+
43
+ model = PeftModel.from_pretrained(model, peft_model_id)
44
+ ```
45
+
46
+ ## Inference Example
47
+ Load the model to evaluate on the test set of NO-CNN/DailyMail dataset:
48
+ ```python
49
+ def load_and_prepare_data_last_prompt(df):
50
+ """ Load and spearates last prompt from prompt """
51
+ # id, turn_id, prompt, answer
52
+ last_prompt = ["Human: " + df['prompt']
53
+ [i].split("Human:")[-1] for i in range(len(df))]
54
+ df['last_prompt'] = last_prompt
55
+ return df
56
+
57
+ def generate_text(text, max_length=200):
58
+ # generate with greedy search
59
+ model_inputs = tokenizer(text, return_attention_mask=True, return_tensors="pt",
60
+ padding=True, truncation=True, max_length=tokenizer_max_len)
61
+
62
+ with torch.no_grad():
63
+ output_tokens = model.generate(
64
+ **model_inputs, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
65
+
66
+ text_outputs = [tokenizer.decode(
67
+ x, skip_special_tokens=True) for x in output_tokens]
68
+
69
+ return text_outputs
70
+
71
+ print("--LOADING EVAL DATAS---")
72
+ eval_data = load_dataset("NorGLM/NO-ConvAI2", data_files="test_PersonaChat_prompt.json")
73
+ prompts = eval_data['train']['prompt']
74
+ positive_samples = eval_data['train']['answer']
75
+
76
+ print("--MAKING PREDICTIONS---")
77
+ model.eval()
78
+
79
+ output_file = <output file name>
80
+ generated_text = []
81
+
82
+ for prompt in tqdm(prompts):
83
+ generated_text.append(generate_text(prompt, max_length=tokenizer_max_len))
84
+
85
+ df = pd.DataFrame({'prompts':prompts, 'generated_text':generated_text, 'positive_sample':positive_samples})
86
+
87
+ print("Save results to csv file...")
88
+ df.to_csv(output_file)
89
+
90
+ ```
91
+
92
+ ## Note
93
+ More training details will be released soon!