File size: 7,134 Bytes
c79d6e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72ddd6e
c79d6e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c4aced
c79d6e6
 
 
7c4aced
 
 
 
 
 
 
 
 
c79d6e6
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
language:
- ja
tags:
- japanese-stablelm
- causal-lm
pipeline_tag: text-generation
datasets:
- wikipedia
- mc4
- cc100
- oscar-corpus/OSCAR-2301
- oscar-corpus/OSCAR-2201
- togethercomputer/RedPajama-Data-1T
license:
- apache-2.0
---

# Japanese-StableLM-Base-Alpha-7B

![japanese-stablelm-icon](./japanese-stablelm-parrot.jpg)

> "A parrot able to speak Japanese, ukiyoe, edo period" โ€” [Stable Diffusion XL](https://clipdrop.co/stable-diffusion)

## Model Description

`japanese-stablelm-base-alpha-7b` is a 7B-parameter decoder-only language model pre-trained on a diverse collection of Japanese and English datasets which focus on maximizing Japanese language modeling performance and Japanese downstream task performance.

For an instruction-following model, check [Japanese-StableLM-Instruct-Alpha-7B](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b) and get access by accepting the terms and conditions.

## Usage

First install additional dependencies in [requirements.txt](./requirements.txt):

```sh
pip install sentencepiece einops
```

Then start generating text with `japanese-stablelm-base-alpha-7b` by using the following code snippet:

```python
import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1")

model = AutoModelForCausalLM.from_pretrained(
    "stabilityai/japanese-stablelm-base-alpha-7b",
    trust_remote_code=True,
)
model.half()

if torch.cuda.is_available():
    model = model.to("cuda")

prompt = """
AI ใง็ง‘ๅญฆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ™ใ‚‹ใซใฏใ€
""".strip()

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

# this is for reproducibility.
# feel free to change to get different result
seed = 23  
torch.manual_seed(seed)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=1,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=False)
print(out)
"""
 AI ใง็ง‘ๅญฆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ™ใ‚‹ใซใฏใ€ใƒ‡ใƒผใ‚ฟ้ง†ๅ‹•ๅž‹ๆ–‡ๅŒ–ใŒๅฟ…่ฆใงใ‚ใ‚‹ใ“ใจใ‚‚ๆ˜Žใ‚‰ใ‹ใซใชใฃใฆใใฆใ„ใพใ™ใ€‚็ ”็ฉถใฎใ‚ใ‚‰ใ‚†ใ‚‹ๅด้ขใงใ€ใƒ‡ใƒผใ‚ฟใŒใ‚ˆใ‚Š้‡่ฆใซใชใฃใฆใ„ใ‚‹ใฎใงใ™ใ€‚
20  ไธ–็ด€ใฎ็ง‘ๅญฆใฏใ€็ ”็ฉถ่€…ใŒ็›ดๆŽฅ็ ”็ฉถใ‚’่กŒใ†ใ“ใจใงใ€็ ”็ฉถใƒ‡ใƒผใ‚ฟใ‚’ๆดป็”จใ—ใฆใใพใ—ใŸใ€‚ใใฎๅพŒใ€ๅคšใใฎ็ง‘ๅญฆๅˆ†้‡Žใงใฏใƒ‡ใƒผใ‚ฟใฏๆ‰‹ๅ‹•ใงๅˆ†ๆžใ•ใ‚Œใ‚‹ใ‚ˆใ†ใซใชใฃใŸใ‚‚ใฎใฎใ€ใ“ใ‚Œใ‚‰ใฎๆ–นๆณ•ใซใฏๅคšๅคงใชใ‚ณใ‚นใƒˆใจๅŠดๅŠ›ใŒใ‹ใ‹ใ‚‹ใ“ใจใŒๅˆ†ใ‹ใ‚Šใพใ—ใŸใ€‚ ใใ“ใงใ€ๅคšใใฎ็ ”็ฉถ่€…ใ‚„็ ”็ฉถ่€…ใ‚ฐใƒซใƒผใƒ—ใฏใ€ใ‚ˆใ‚ŠๅŠน็Ž‡็š„ใชๆ‰‹ๆณ•ใ‚’้–‹็™บใ—ใ€็ ”็ฉถใฎ่ฆๆจกใ‚’ๆ‹กๅคงใ—ใฆใใพใ—ใŸใ€‚21 ไธ–็ด€ใซใชใ‚‹ใจใ€็ ”็ฉถ่€…ใŒๆ‰‹ๅ‹•ใงๅฎŸๆ–ฝใ™ใ‚‹ๅฟ…่ฆใฎใ‚ใ‚‹็ ”็ฉถใฏใ€ใใฎๅคง้ƒจๅˆ†ใ‚’็ ”็ฉถ่€…ใŒ่‡ชๅ‹•ๅŒ–ใงใใ‚‹ใ‚ˆใ†ใซใชใ‚Šใพใ—ใŸใ€‚
"""
```

We suggest playing with different generation config (`top_p`, `repetition_penalty` etc) to find the best setup for your tasks. For example, use higher temperature for roleplay task, lower temperature for reasoning.

## Model Details

* **Model type**: `japanese-stablelm-base-alpha-7b` model is an auto-regressive language model based on the NeoX transformer architecture.
* **Language(s)**: Japanese
* **Library**: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
* **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).


## Training

| Parameters | Hidden Size | Layers | Heads | Sequence Length |
|------------|-------------|--------|-------|-----------------|
| 7B         | 4096        | 32     | 32    | 2048            |

### Training Dataset

`japanese-stablelm-base-alpha-7b` is pre-trained on around 750B tokens from a mixture of the following corpora:

- [Japanese/English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
- [Japanese mc4](https://huggingface.co/datasets/mc4)
- [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz)
- [Japanese OSCAR](https://oscar-project.github.io/documentation/)
- [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)

## Use and Limitations

### Intended Use

The model is intended to be used by all individuals as foundational models for application-specific fine-tuning without strict limitations on commercial use.

### Limitations and bias

The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters which can be reflected in the model generated text. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.

## Authors
- [Meng Lee](https://huggingface.co/leemeng)
- [Fujiki Nakamura](https://huggingface.co/fujiki)
- [Makoto Shing](https://huggingface.co/mkshing)
- [Paul McCann](https://huggingface.co/polm-stability)
- [Takuya Akiba](https://huggingface.co/iwiwi)
- [Naoki Orii](https://huggingface.co/mrorii)

## Acknowledgements

We are utilizing the v1 version of the [novelai-tokenizer](https://github.com/NovelAI/novelai-tokenizer), introduced by [NovelAI](https://novelai.net/), because it processes both Japanese and English text effectively and efficiently. We extend our gratitude to NovelAI for allowing us to use their remarkable work. For more details about the tokenizer, please refer to their [blog post](https://blog.novelai.net/novelais-new-llm-tokenizer-5bc140e17642).

We are grateful for the contributions of the EleutherAI Polyglot-JA team in helping us to collect a large amount of pre-training data in Japanese. Polyglot-JA members includes Hyunwoong Ko (Project Lead), Fujiki Nakamura (originally started this project when he commited to the Polyglot team), Yunho Mo, Minji Jung and Su-Kyeong Jang.

We are also appreciative of [AI Novelist/Sta (Bit192, Inc.)](https://ai-novel.com/index.php) and the numerous contributors from [Stable Community Japan](https://discord.gg/VPrcE475HB) for assisting us in gathering a large amount of high-quality Japanese textual data for model training.

## How to cite
```
@misc{JapaneseStableLMBaseAlpha7B, 
      url={[https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b)}, 
      title={Japanese StableLM Base Alpha 7B}, 
      author={Lee, Meng and Nakamura, Fujiki and Shing, Makoto and McCann, Paul and Akiba, Takuya and Orii, Naoki}
}
```

## Citations

```bibtext
@software{gpt-neox-library,
  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
  url = {https://www.github.com/eleutherai/gpt-neox},
  doi = {10.5281/zenodo.5879544},
  month = {8},
  year = {2021},
  version = {0.0.1},
}
```