llm-jp-3-172b-beta1-instruct / README.md

update license

0592a0e 7 days ago

20 kB

	---
	extra_gated_prompt: >-
	### 「LLM-jp-3 172B beta1」利用規約

	この利用規約（以下「本規約」といいます）は、大学共同利用機関法人情報・システム研究機構国立情報学研究所（以下「提供者」といいます）による開発の成果物として公開する大規模言語モデル「LLM-jp-3 172B beta1」（以下「本プログラム」といいます）の利用に関する条件を定めるものです。本プログラムの利用者（以下「利用者」といいます）は、本規約に同意した上で本プログラムを利用するものとします。

	- 第１条（利用許諾）
	1. 本プログラムの利用者は、本規約とは別に定める方法により本プログラムの利用を申請し、提供者から個別の許諾を得るものとします。
	2. 利用者は、本規約に従い、本プログラムを商用または非商用目的を問わず利用することができます。利用者は、本プログラムの改変、複製を行うことができますが、本プログラムおよび本プログラムを改変し作成したプログラム（以下「改変物」といいます）の再配布を行うことはできません。利用者は、本プログラムもしくは改変物を用いてサービスを提供することはできますが、サービスの利用者が本プログラムまたは改変物を直接取得することができる形での提供はできません。
	3. 本規約に違反した利用者は、本プログラムを利用することはできません。

	- 第２条（責任）
	1. 利用者は、本プログラムは現状有姿で提供され、提供者は、明示または黙示を問わず、本プログラムに関し、その正確性、完全性、最新性、および品質など、いかなる保証も行わず、利用者が本プログラムを利用したこと、利用できなかったことにより生じた一切の損害について責任を負わないことを、予め承諾するものとします。
	2. 利用者は、利用者による本プログラムの利用により、または、利用者が本利用規約に違反したことにより提供者が損害を被った場合、当該損害を賠償するものとします。
	3. 利用者は、自己の責任と判断において利用するものとし、本プログラムの利用に関して、第三者との間で生じた紛争について、自らの責任と負担で対応し、提供者に一切の迷惑を掛けないものとします。利用者は本プログラムの利用によって生じた損害について自己の責任で対処するものとします。

	- 第３条（禁止行為）
	利用者は本プログラムを利用して以下の行為を行わないものとします。
	(1) 提供者もしくは第三者の知的財産権を侵害する行為、または侵害するおそれのある行為
	(2) 提供者もしくは第三者の財産、プライバシーもしくは肖像権を侵害する行為、または侵害するおそれのある行為
	(3) 提供者もしくは第三者を差別もしくは誹謗中傷・侮辱し、他者への差別を助長し、または名誉もしくは信用を毀損する行為
	(4) 提供者もしくは第三者への迷惑行為、または迷惑になる恐れのある行為
	(5) 許可されていない法律業務に従事したり、有資格の専門家以外からの法律アドバイスを提供したりする行為
	(6) 有資格の専門家以外からの財務アドバイスを提供する行為
	(7) 健康への助言や治療方法の提示などを含む医療行為
	(8) その他法令に基づく許可等が必要な行為

	- 第４条（制約事項）
	1. 利用者は、本プログラムを用いた処理の結果物（以下「処理結果」という）には、虚偽や偏り、他人の権利を侵害する内容、または利用者の想定する有効性や有用性を満たさない内容が含まれている場合があることを承諾し、不正確・不適切な処理結果により、自ら又は第三者の損害や権利侵害の発生、倫理的懸念が起こり得るという前提に立ち本プログラムを利用するものとします。利用者は、処理結果の正誤や適法性、倫理的妥当性を自ら確認の上、利用するものとします。利用者が処理結果を含め本プログラムを用いたことにより、利用者自身又は第三者の権利侵害を発生させた場合、提供者はその損害に対して一切の責任を負わないものとし、利用者は提供者に対し一切の迷惑を掛けないものとします。
	2. 利用者は処理結果について、それぞれの国や地域において法令などの規制を順守した上で利用するものとします。
	3. 利用者は、処理結果を第３条（禁止事項）に記載の行為に利用しないものとします。

	- 第５条（権利帰属等）
	1. 利用者は、本利用規約で明示で定めるものを除き本プログラムに関する一切の権利を取得することはありません。
	2. 利用者は、本プログラム改変物の作成によって新たに発生した権利を取得しますが、改変物の利用に当たっては本利用規約に従って利用するものとします。
	3. 提供者は処理結果について、権利主張を行わないものとします。

	- 第６条（輸出取引）
	利用者は、本プログラムおよび処理結果の利用に関連して外国為替及び外国貿易法（これに関連する政省令を含む）または米国輸出管理法令で規定する許可が必要な輸出を行うときは、利用者自らが所定の許可を取得するものとします。

	- 第７条（管轄裁判所）
	本利用規約に関し生じた紛争については、東京地方裁判所をもって第一審の専属的合意管轄裁判所とします。

	- 第８条（準拠法）
	本利用規約は日本法に準拠します。

	- 第９条（その他の規定）
	本規約は、本プログラムの利用者と提供者との間の利用に関する全ての事項を定めるものであり、本規約に定めのない事項については、関係法令に従うものとします。

	- 第１０条（言語）
	本規約は日本語を正本とします。本規約の英訳版は、参考のために作成されたものであり、何らの法的拘束力もないものとします。

	以上

	### LLM-jp-3 172B beta1 Terms of Use

	This Terms of Use (hereinafter referred to as "TOU") sets forth the conditions for the use of the large-scale language model LLM-jp-3 172B beta1 (hereinafter referred to as "the Program") that is made public as a result of the development by the Research and Development Center for Large Language Models at the National Institute of Informatics (hereinafter referred to as "the Provider"). Users of the Program (hereinafter referred to as "Users") shall use the Program upon agreeing to the TOU.

	- Article 1 (License to Use)
	1. Users of the Program must apply for the use of the Program by a method separately specified in addition to the TOU and obtain individual permission from the Provider.
	2. Users may use the Program for commercial or non-commercial purposes in accordance with the TOU. Users are allowed to modify and duplicate the Program, but redistribution of the Program and/or the large-scale language model created by modifying the Program (hereinafter referred to as "Modified Works") is prohibited. Users may provide services using the Program or Modified Works, but such services must not allow third parties to access, download, or obtain the Program or Modified Works directly.
	3. Users who violate the TOU are not allowed to use the Program.

	- Article 2 (Responsibility)
	1. Users agree in advance that the Program is provided “AS IS”, and the Provider makes no warranties, express or implied, regarding the Program, including, but not limited to, its accuracy, completeness, up-to-dateness, and quality, and that the Provider shall not be liable for any damages arising from the use or inability to use the Program.
	2. Users shall compensate for any and all damages suffered by the Provider as a result of the use of the Program and/or the Users' violation of the TOU.
	3. Users shall use the Program at their own responsibility and discretion, and shall handle any disputes arising with third parties in relation to the use of the Program at their own responsibility and expense, and shall indemnify, defend and hold harmless the Provider against all damages and losses without causing any inconvenience to the Provider. Users shall deal with any damages caused by the use of the Program at their own responsibility.

	- Article 3 (Prohibited Actions)

	Users shall not engage in the following actions when using the Program.
	(1) Actions that will or may infringe on the intellectual property rights of the Provider or third parties;
	(2) Actions that will or may infringe on the property, privacy, or portrait rights of the Provider or third parties;
	(3) Actions that discriminate against, defame, insult, or slander the Provider or third parties, promote discrimination against others, or damage the reputation or credibility of others;
	(4) Actions that will or may cause inconvenience or harm to the Provider or third parties;
	(5) Actions that engage in unauthorized legal services and/or provide legal advice from anyone other than a qualified professional;
	(6) Actions that provide financial advice from anyone other than a qualified professional;
	(7) Medical actions, including providing health advice or suggesting treatment methods; and
	(8) Other actions that require permissions or other forms of authorization under laws and regulations.

	- Article 4 (Restrictions)
	1. Users acknowledge that the results of processing using the Program (hereinafter referred to as "Processing Results") may contain falsehoods, biases, content that infringes on the rights of others, or content that does not meet the effectiveness or usefulness expected by Users, and agree to use the Program on the premise that inaccurate or inappropriate Processing Results may cause damage or infringement of rights to Users or third parties and/or ethical concerns. Users shall use the Processing Results after confirming their accuracy, legality, and ethical validity themselves. If the use of the Program, including the Processing Results, by Users cause infringement of the rights of the Users themselves or third parties, the Provider shall not be responsible for any damages, and the Users shall indemnify, defend and hold harmless the Provider against all damages and losses without causing any inconvenience to the Provider.
	2. Users shall use the Processing Results in compliance with the regulations such as laws and regulations in each country and region.
	3. Users shall not use the Processing Results for the actions listed in Article 3 (Prohibited Actions).

	- Article 5 (Ownership of Rights)
	1. Except as expressly provided in the TOU, Users shall not acquire any rights in relation to the Program.
	2. Users will acquire rights newly arising from the creation of Modified Works of the Program, but Users shall use Modified Works in accordance with the TOU.
	3. The Provider shall not assert any rights to the Processing Results.

	- Article 6 (Export Transaction)
	Users shall obtain the necessary permissions themselves when exporting the Program and the Processing Results in relation to their use, where such export requires permissions under the Foreign Exchange and Foreign Trade Act (including related cabinet order and ministerial order) or U.S. export control laws and regulations.

	- Article 7 (Jurisdiction)
	The Tokyo District Court shall have exclusive jurisdiction in the court of the first instance over any disputes arising out of or in connection with the TOU.

	- Article 8 (Governing Law)
	The TOU is governed by and construed in accordance with the laws of Japan.

	- Article 9 (Other Provisions)
	The TOU sets forth the entire agreement as to all matters concerning the use of the Program between the Users and the Provider, and matters not provided for in the TOU shall be governed by the relevant laws and regulations.

	- Article 10 (Governing Language)
	The governing language of the TOU shall be Japanese. The English translation hereof is made for reference purpose only and shall have no effect.


	extra_gated_fields:
	Name: text
	Affiliation: text
	I want to use this model for: text

	license: other
	license_name: llm-jp-3-172b-beta1-tou
	license_link: LICENSE
	language:
	- en
	- ja
	programming_language:
	- C
	- C++
	- C#
	- Go
	- Java
	- JavaScript
	- Lua
	- PHP
	- Python
	- Ruby
	- Rust
	- Scala
	- TypeScript
	library_name: transformers
	pipeline_tag: text-generation
	inference: false
	---
	# llm-jp-3-172b-beta1-instruct

	This repository provides large language models developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/).

	The development was partially supported by [GENIAC](https://www.meti.go.jp/policy/mono_info_service/geniac/index.html).

	\| Model Variant \|
	\| :--- \|
	\| [llm-jp-3-172b-beta1](https://huggingface.co/llm-jp/llm-jp-3-172b-beta1) \|
	\| [llm-jp-3-172b-beta1-instruct](https://huggingface.co/llm-jp/llm-jp-3-172b-beta1-instruct) \|


	Checkpoints format: Hugging Face Transformers


	## Required Libraries and Their Versions

	- torch>=2.3.0
	- transformers>=4.40.1
	- tokenizers>=0.19.1
	- accelerate>=0.29.3
	- flash-attn>=2.5.8

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-172b-beta1-instruct")
	model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-3-172b-beta1-instruct", device_map="auto", torch_dtype=torch.bfloat16)
	chat = [
	{"role": "system", "content": "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"},
	{"role": "user", "content": "自然言語処理とは何か"},
	]
	tokenized_input = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(model.device)
	with torch.no_grad():
	output = model.generate(
	tokenized_input,
	max_new_tokens=100,
	do_sample=True,
	top_p=0.95,
	temperature=0.7,
	repetition_penalty=1.05,
	)[0]
	print(tokenizer.decode(output))
	```


	## Model Details

	- Model type: Transformer-based Language Model
	- Total seen tokens: 700B

	\|Params\|Layers\|Hidden size\|Heads\|Context length\|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\|172b\|96\|12288\|96\|4096\|

	## Tokenizer

	The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
	The vocabulary entries were converted from [`llm-jp-tokenizer v3.0`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v3.0b2).
	Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).

	## Datasets

	### Pre-training

	The models have been pre-trained using a blend of the following datasets.

	\| Language \| Dataset \| Tokens\|
	\|:---\|:---\|---:\|
	\|Japanese\|[Wikipedia](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|2.6B
	\|\|[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|762.8B
	\|\|[WARP/PDF](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|237.3B
	\|\|[WARP/HTML](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|2.7B
	\|\|[Kaken](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|1.8B
	\|English\|[Wikipedia](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|4.7B
	\|\|[Dolma/CC-head](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|608.5B
	\|\|[Dolma/C4](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|181.6B
	\|\|[Dolma/Reddit](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|83.1B
	\|\|[Dolma/PeS2o](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|62.9B
	\|\|[Dolma/Gutenberg](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|5.5B
	\|\|[Dolma/Wiki](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3)\|3.9B
	\|Code\|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)\|114.1B
	\|Chinese\|[Wikipedia](https://huggingface.co/datasets/bigcode/the-stack)\|0.8B
	\|Korean\|[Wikipedia](https://huggingface.co/datasets/bigcode/the-stack)\|0.3B

	### Instruction tuning

	The models have been fine-tuned on the following datasets.

	\| Language \| Dataset \| description \|
	\|:---\|:---\|:---\|
	\|Japanese\|[ichikara-instruction-004-002](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)\| A manually constructed Japanese instruction dataset \|
	\| \|[answer-carefully-001](https://liat-aip.sakura.ne.jp/wp/answercarefully-dataset/)\| A manually constructed Japanese instruction dataset focusing on LLMs' safety \|
	\| \|[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)\| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL \|
	\| \|[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)\| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL \|
	\| \|[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)\| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL \|
	\| \|aya-dataset-ja\| A Japanese subset of [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) \|
	\| \|ichikara-instruction-format\| A small amount of instruction dataset edited from ichikara-instruction, with some constraints on the output format. \|
	\|English \|[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) \| - \|
	\| \|[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)\| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) \|
	\| \|[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)\| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) \|
	\| \|[Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)\| - \|
	\| \|[FLAN](https://huggingface.co/datasets/Open-Orca/FLAN) \| We used sampled one. \|

	## Risks and Limitations

	The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.


	## Send Questions to

	llm-jp(at)nii.ac.jp


	## License

	See the [LICENSE](LICENSE) file.


	## Model Card Authors

	The names are listed in alphabetical order.

	Hirokazu Kiyomaru and Takashi Kodama.