--- datasets: - Locutusque/TM-DATA-V2 - LLM360/TxT360 - mlfoundations/dclm-baseline-1.0 - Skylion007/openwebtext - JeanKaddour/minipile language: - en license: apache-2.0 --- still in training. Trained on about ~17 billion tokens so far.