File size: 5,437 Bytes
a694196 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
datasets:
- mc4
- wikipedia
- EleutherAI/pile
- oscar-corpus/colossal-oscar-1.0
- cc100
language:
- ja
- en
tags:
- qwen
inference: false
---
# `rinna/nekomata-7b`
![rinna-icon](./rinna.png)
# Overview
We conduct continual pre-training of [qwen-7b](https://huggingface.co/Qwen/Qwen-7B) on **30B** tokens from a mixture of Japanese and English datasets. The continual pre-training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.
* The inclusive Qwen vocabulary (vocab size > 150k) enables the model to processs Japanese texts much more efficiently than the previously released [youri series](https://huggingface.co/collections/rinna/youri-7b-654053610cb8e9d8e6289efc).
* The model supports a maximum sequence length of 32768.
The name `nekomata` comes from the Japanese word [`猫又/ねこまた/Nekomata`](https://ja.wikipedia.org/wiki/%E7%8C%AB%E5%8F%88), which is a kind of Japanese mythical creature ([`妖怪/ようかい/Youkai`](https://ja.wikipedia.org/wiki/%E5%A6%96%E6%80%AA)).
* **Library**
The model was trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox).
* **Model architecture**
A 32-layer, 4096-hidden-size transformer-based language model. Please refer to the [Qwen paper](https://arxiv.org/abs/2309.16609) for architecture details.
* **Continual pre-training**
The model was initialized with the [qwen-7b](https://huggingface.co/Qwen/Qwen-7B) model and continually trained on around **30B** tokens from a mixture of the following corpora
- [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz)
- [Japanese C4](https://huggingface.co/datasets/mc4)
- [Japanese OSCAR](https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0)
- [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
- [Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
- rinna curated Japanese dataset
* **Authors**
- [Tianyu Zhao](https://huggingface.co/tianyuz)
- [Akio Kaga](https://huggingface.co/rakaga)
- [Kei Sawada](https://huggingface.co/keisawada)
---
# Benchmarking
Please refer to [rinna's LM benchmark page](https://rinnakk.github.io/research/benchmarks/lm/index.html).
---
# How to use the model
~~~~python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-7b", trust_remote_code=True)
# Use GPU with bf16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, bf16=True)
# Use GPU with fp16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, fp16=True)
# Use CPU
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="cpu", trust_remote_code=True)
# Automatically select device and precision
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True)
text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_new_tokens=200,
min_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
~~~~
---
# Tokenization
The model uses the original Qwen tokenizer. It augments the [`cl100k` tiktoken tokenizer](https://github.com/openai/tiktoken) and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.
We compared the `Qwen` tokenizer (as used in `nekomata`) and the `llama-2` tokenizer (as used in `youri`) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.
| Tokenizer | Japanese | English | Multilingual |
| --- | --- | --- | --- |
| Qwen | 0.24 | 0.27 | 0.27 |
| llama-2 | 0.40 | 0.29 | 0.36 |
---
# How to cite
~~~
@misc{RinnaNekomata7b,
url={https://huggingface.co/rinna/nekomata-7b},
title={rinna/nekomata-7b},
author={Zhao, Tianyu and Kaga, Akio and Wakatsuki, Toshiaki and Sawada, Kei}
}
~~~
---
# Citations
~~~
@software{gpt-neox-library,
title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
url = {https://www.github.com/eleutherai/gpt-neox},
doi = {10.5281/zenodo.5879544},
month = {8},
year = {2021},
version = {0.0.1},
}
~~~
---
# License
[Tongyi Qianwen LICENSE AGREEMENT](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) |