deepapaikar
/

CutomGPTKatz

Model card Files Files and versions Community

CutomGPTKatz / data /openwebtext /readme.md

deepapaikar's picture

Upload folder using huggingface_hub

eaa7499 verified 7 months ago

|

history blame contribute delete

489 Bytes


	## openwebtext dataset

	after running `prepare.py` (preprocess) we get:

	- train.bin is ~17GB, val.bin ~8.5MB
	- train has ~9B tokens (9,035,582,198)
	- val has ~4M tokens (4,434,897)

	this came from 8,013,769 documents in total.

	references:

	- OpenAI's WebText dataset is discussed in [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
	- [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset