transformer-lm-japanese-0.1b / README.md

Update README.md

1ea9980 over 1 year ago

5.21 kB

	---
	license: apache-2.0
	datasets:
	- wiki40b
	language:
	- ja
	tags:
	- ja
	- japanese
	- text-generation
	- lm
	- jax
	- flax
	- lm1b
	---
	# transformer-lm-japanese-0.1b

	## Model Description

	This is a JAX/Flax-based transformer language model trained on a Japanese dataset. It is based on the official Flax example code ([lm1b](https://github.com/google/flax/tree/main/examples/lm1b)).

	## Model Sources

	We've modified Flax's 'lm1b' example to train on Japanese dataset. You can find the code on Github.

	* [transformer-lm-japanese](https://github.com/FookieMonster/transformer-lm-japanese)

	## Model Details

	\| Model \| Params \| Layers \| Dim \| Heads \| PPL \| Dataset \| Training time \|
	\|-\|-\|-\|-\|-\|-\|-\|-\|
	\| lm1b-default \| 0.05B \| 6 \| 512 \| 8 \| 22.67 \| lm1b \| 0.5 days \|
	\| transformer-lm-japanese-0.1b \| 0.1B \| 12 \| 768 \| 12 \| 35.22 \| wiki40b/ja \| 1.5 days \|

	## Usage

	Here, we explain the procedure to generate text from pretrained weights using a CPU. We used the following instance on GCE for the Python 3.8 environment.

	* Machine Type: c2-standard-4 (4 CPUs, 16GB Memory)
	* Disk: 100GB (Standard Persistent Disk)
	* OS: Ubuntu 20.04 LTS x86/64

	Install Python 3.8 and pip.

	```
	sudo apt-get update
	sudo apt-get install python3.8 python3-pip build-essential
	```

	Install the huggingface_hub library.

	```
	pip install --upgrade huggingface_hub
	```

	Run the Python interpreter and download the model files.

	```
	cd $HOME
	python3
	```

	```python
	>>> from huggingface_hub import hf_hub_download
	>>> hf_hub_download(repo_id="fukugawa/transformer-lm-japanese-0.1b", filename="sentencepiece_model", revision="v1", local_dir="./logs/japanese_0.1b_v1", local_dir_use_symlinks=False)
	>>> hf_hub_download(repo_id="fukugawa/transformer-lm-japanese-0.1b", filename="checkpoint_499999", revision="v1", local_dir="./logs/japanese_0.1b_v1", local_dir_use_symlinks=False)
	```

	Clone the source code and install the necessary Python packages.

	```
	git clone -b 1.0.0.RC2 https://github.com/FookieMonster/transformer-lm-japanese
	cd ./transformer-lm-japanese/transformer_lm
	pip install -r requirements.txt
	```

	Install the necessary Python packages to run on the CPU.

	```
	pip install jax[cpu]==0.3.2
	pip install chex==0.1.5
	pip install protobuf==3.20.3
	pip install typing-extensions==3.10.0.2
	```

	Text generation using downloaded model files.

	```
	python3 generate_text.py --workdir=$HOME/logs/japanese_0.1b_v1 \
	--config=configs/japanese_0.1b_v1.py \
	--config.sampling_temperature=0.6 \
	--config.sampling_top_k=20 \
	--config.seed=0 \
	--config.prompts="夏目漱石は、" \
	--num_generated_texts=10
	```

	```
	Generating text.
	Sample: 夏目漱石は、自分の作品を「文学の本」として出版することを構想していた。
	Generating text.
	Sample: 夏目漱石は、明治の文学運動を「文学の原点に立ち帰る」と位置づけ、漱石が「文学の本質をあらわすのが文学である」との認識を、当時の知識人たちが持っていたことを指摘している。
	Generating text.
	Sample: 夏目漱石は、小説『坊っちゃん』で、この「坊っちゃん」を「坊っちゃん」に置き換えた。「坊っちゃん」は、坊っちゃんの「坊」の字を、「坊」は「坊」の字をもじってつけられた。
	Generating text.
	Sample: 夏目漱石は、漱石の『坊っちゃん』を読んで、「漱石は、私に『坊っちゃん』をおもしろおかしく書かせた。これは、私に『坊っちゃん』を書かせるのを、私に教えてくれたからだ」と述懐している。
	Generating text.
	Sample: 夏目漱石は、自身の著作『漱石全集』の中で「漱石が生涯のほとんどを漱石の文学に捧げた」と評価している。
	Generating text.
	Sample: 夏目漱石は、漱石が「『吾輩は猫』を観るのが嫌だ」と言ったのを、漱石が「あんなに怖いとは思わなかった」と返している。
	Generating text.
	Sample: 夏目漱石は、自身の日記の中で「文学の本質と現実との間には、対立関係があり、また対立関係があっても、それが文学の本質と現実との間には関係がある」と書いている。
	Generating text.
	Sample: 夏目漱石は、夏目が漱石の『吾輩は猫である』を読んでいた時に、漱石の『吾輩は猫である』を読んだという。漱石は「猫は猫である」と書いていたが、漱石は「猫である」と書いた。
	Generating text.
	Sample: 夏目漱石は、小説『坊っちゃん』の中で、主人公が「おばあさん」と「おばあさん」の2人で暮らしていると、その家から「おばあさん」と「おばあさん」が飛び出してくるという話を紹介している。
	Generating text.
	Sample: 夏目漱石は、漱石の「吾輩は猫である」という言葉を、漱石が「猫を飼っている人は猫である」という誤解から誤解したのだろうと、著書『猫の散歩道』で述べている。
	```

	## Dataset

	* wiki40b/ja

	## Tokenization

	* [sentencepiece](https://github.com/google/sentencepiece)

	## Author

	[Ryoichi Fukugawa](https://huggingface.co/fukugawa)