File size: 1,609 Bytes
8e5c139
aa47087
 
 
 
bb98e09
aa47087
 
 
 
 
 
 
 
 
 
 
 
 
 
d247320
aa47087
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
770b64b
 
aa47087
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Model Info (Interal):
- Size: 7B
- Dataset: The Pile v2
  - `contaminated(P3) + lower_code(5%) + wiki(fixed) + books3(fixed & broken)`
- Batch size (in tokens): 8M
- Checkpoint Step: 69,000 (552B tokens)
- Checkpoint path (AWS East): `/fsx/ckpts/7b_tok=neox_data=pilev2-recontam_lower-code_bs=8m_tp=4_pp=1_init=wang-small-init/global_step69000_hf`

Notes:
- Trained for 36k steps with incorrectly tokenized Books3 dataset (GPT-2 tokenizer instead of NeoX tokenizer)
- tp=2 (not 4)

W&B Report: https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-7B-alpha---Vmlldzo2MjA

Usage:

```python
import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("CarperAI/7b-alpha")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.paddding_side = "left"

prompts = [
    "User1: The dog sat on a man's lap and barked 3 times.\nUser2: How many times did the dog bark?"
    "Curious Person Question: A group of genetically identical individuals is called what?\nSmart Person Answer: a clone\n\nCurious Person Question: Who proposed the theory of evolution by natural selection?\nSmart Person Answer:"
]
batch_encoding = tokenizer(prompts, return_tensors="pt", padding=True)

print(f"Generating {len(prompts)} prompts...")
samples = model.generate(
    **batch_encoding,
    max_new_tokens=64,
    temperature=0.0,
    do_sample=False,
)
samples = tokenizer.batch_decode(samples, skip_special_tokens=True)
for prompt, sample in zip(prompts, samples):
    print(f"Prompt: {prompt}\nSample: {sample}\n")
```