metadata

license: apache-2.0
datasets:
  - wiki40b
language:
  - ja
tags:
  - ja
  - japanese
  - text-generation
  - lm
  - jax
  - flax
  - lm1b

transformer-lm-japanese-0.1b

Model Description

This is a JAX/Flax-based transformer language model trained on a Japanese dataset. It is based on the official Flax example code (lm1b).

Model Sources

We've modified Flax's 'lm1b' example to train on Japanese dataset. You can find the code on Github.

transformer-lm-japanese

Model Details

Model	Params	Layers	Dim	Heads	PPL	Dataset	Training time
lm1b-default	0.05B	6	512	8	22.67	lm1b	0.5 days
transformer-lm-japanese-0.1b	0.1B	12	768	12	35.22	wiki40b/ja	1.5 days

Usage

Here, we explain the procedure to generate text from pretrained weights using a CPU. We used the following instance on GCE for the Python 3.8 environment.

Machine Type: c2-standard-4 (4 CPUs, 16GB Memory)
Disk: 100GB (Standard Persistent Disk)
OS: Ubuntu 20.04 LTS x86/64

Install Python 3.8 and pip.

sudo apt-get update
sudo apt-get install python3.8 python3-pip build-essential

Install the huggingface_hub library.

pip install --upgrade huggingface_hub

Run the Python interpreter and download the model files.

cd $HOME
python3

>>> from huggingface_hub import hf_hub_download
>>> hf_hub_download(repo_id="fukugawa/transformer-lm-japanese-0.1b", filename="sentencepiece_model", revision="v1", local_dir="./logs/japanese_0.1b_v1", local_dir_use_symlinks=False)
>>> hf_hub_download(repo_id="fukugawa/transformer-lm-japanese-0.1b", filename="checkpoint_499999", revision="v1", local_dir="./logs/japanese_0.1b_v1", local_dir_use_symlinks=False)

Clone the source code and install the necessary Python packages.

git clone -b 1.0.0.RC2 https://github.com/FookieMonster/transformer-lm-japanese
cd ./transformer-lm-japanese/transformer_lm
pip install -r requirements.txt

Install the necessary Python packages to run on the CPU.

pip install jax[cpu]==0.3.2
pip install chex==0.1.5
pip install protobuf==3.20.3
pip install typing-extensions==3.10.0.2

Text generation using downloaded model files.

python3 generate_text.py --workdir=$HOME/logs/japanese_0.1b_v1 \
    --config=configs/japanese_0.1b_v1.py \
    --config.sampling_temperature=0.6 \
    --config.sampling_top_k=20 \
    --config.seed=0 \
    --config.prompts="夏目漱石は、" \
    --num_generated_texts=10

Generating text.
Sample: 夏目漱石は、自分の作品を「文学の本」として出版することを構想していた。
Generating text.
Sample: 夏目漱石は、明治の文学運動を「文学の原点に立ち帰る」と位置づけ、漱石が「文学の本質をあらわすのが文学である」との認識を、当時の知識人たちが持っていたことを指摘している。
Generating text.
Sample: 夏目漱石は、小説『坊っちゃん』で、この「坊っちゃん」を「坊っちゃん」に置き換えた。「坊っちゃん」は、坊っちゃんの「坊」の字を、「坊」は「坊」の字をもじってつけられた。
Generating text.
Sample: 夏目漱石は、漱石の『坊っちゃん』を読んで、「漱石は、私に『坊っちゃん』をおもしろおかしく書かせた。これは、私に『坊っちゃん』を書かせるのを、私に教えてくれたからだ」と述懐している。
Generating text.
Sample: 夏目漱石は、自身の著作『漱石全集』の中で「漱石が生涯のほとんどを漱石の文学に捧げた」と評価している。
Generating text.
Sample: 夏目漱石は、漱石が「『吾輩は猫』を観るのが嫌だ」と言ったのを、漱石が「あんなに怖いとは思わなかった」と返している。
Generating text.
Sample: 夏目漱石は、自身の日記の中で「文学の本質と現実との間には、対立関係があり、また対立関係があっても、それが文学の本質と現実との間には関係がある」と書いている。
Generating text.
Sample: 夏目漱石は、夏目が漱石の『吾輩は猫である』を読んでいた時に、漱石の『吾輩は猫である』を読んだという。漱石は「猫は猫である」と書いていたが、漱石は「猫である」と書いた。
Generating text.
Sample: 夏目漱石は、小説『坊っちゃん』の中で、主人公が「おばあさん」と「おばあさん」の2人で暮らしていると、その家から「おばあさん」と「おばあさん」が飛び出してくるという話を紹介している。
Generating text.
Sample: 夏目漱石は、漱石の「吾輩は猫である」という言葉を、漱石が「猫を飼っている人は猫である」という誤解から誤解したのだろうと、著書『猫の散歩道』で述べている。

Dataset

wiki40b/ja

Tokenization

sentencepiece

Author

Ryoichi Fukugawa