deepapaikar
/

CutomGPTKatz

Model card Files Files and versions Community

CutomGPTKatz / data /openwebtext /readme.md

deepapaikar's picture

Upload folder using huggingface_hub

eaa7499 verified 7 months ago

|

history blame contribute delete

489 Bytes

openwebtext dataset

after running prepare.py (preprocess) we get:

train.bin is ~17GB, val.bin ~8.5MB
train has ~9B tokens (9,035,582,198)
val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

OpenAI's WebText dataset is discussed in GPT-2 paper
OpenWebText dataset