Yuta Hayashibe
commited on
Commit
•
3ce8a0e
1
Parent(s):
b7e22bb
Update README.md
Browse files
README.md
CHANGED
@@ -17,15 +17,21 @@ datasets:
|
|
17 |
[megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
|
18 |
Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
|
19 |
|
20 |
-
###
|
|
|
|
|
21 |
|
22 |
- Japanese in [mC4/3.0.1](https://huggingface.co/datasets/mc4) (We used [Tensorflow native format](https://github.com/allenai/allennlp/discussions/5056))
|
|
|
|
|
23 |
- [Japanese](https://www.tensorflow.org/datasets/catalog/wiki40b#wiki40bja) in [wiki40b/1.3.0](https://www.tensorflow.org/datasets/catalog/wiki40b)
|
|
|
|
|
24 |
|
25 |
|
26 |
### Tokenizer
|
27 |
|
28 |
-
|
29 |
|
30 |
- Vocabulary size: 32,000
|
31 |
- [Byte-fallback](https://github.com/google/sentencepiece/releases/tag/v0.1.9): Enabled
|
|
|
17 |
[megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
|
18 |
Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
|
19 |
|
20 |
+
### Corpora
|
21 |
+
|
22 |
+
We used following corpora for pre-training.
|
23 |
|
24 |
- Japanese in [mC4/3.0.1](https://huggingface.co/datasets/mc4) (We used [Tensorflow native format](https://github.com/allenai/allennlp/discussions/5056))
|
25 |
+
- 87,425,304 pages
|
26 |
+
- 782 GB in TFRecord format
|
27 |
- [Japanese](https://www.tensorflow.org/datasets/catalog/wiki40b#wiki40bja) in [wiki40b/1.3.0](https://www.tensorflow.org/datasets/catalog/wiki40b)
|
28 |
+
- 828,236 articles (2,073,584 examples)
|
29 |
+
- 2 GB in TFRecord format
|
30 |
|
31 |
|
32 |
### Tokenizer
|
33 |
|
34 |
+
We used Japanese Wikipedia to train [SentencePiece](https://github.com/google/sentencepiece).
|
35 |
|
36 |
- Vocabulary size: 32,000
|
37 |
- [Byte-fallback](https://github.com/google/sentencepiece/releases/tag/v0.1.9): Enabled
|