Post
ποΈ Massive data release on the HF Hub for 75 languages!
https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2
In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. π€ So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:
Let's use this big blob of data to build something awesome in our languages! π₯³
https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2
In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. π€ So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:
load_dataset("BramVanroy/hplt_mono_v1_2", "nl_cleaned")
Let's use this big blob of data to build something awesome in our languages! π₯³