Releasing the largest multilingual open pretraining dataset
•
94
from datasets import load_dataset
pdfa_dataset = load_dataset('pixparse/pdfa-eng-wds', streaming=True)
IDL_dataset = load_dataset('pixparse/idl-wds', streaming=True)
import chug
task_cfg = chug.DataTaskDocReadCfg(
page_sampling='all',
)
data_cfg = chug.DataCfg(
source='pixparse/pdfa-eng-wds',
split='train',
batch_size=None,
format='hfids',
num_workers=0,
)
data_loader = chug.create_loader(
data_cfg,
task_cfg,
)
sample = next(iter(data_loader))
Congratulations! With all the US/EU big players being more secretive than ever, you're not just bringing good models, but really making an incredible contribution to open research.
And I slightly disagree on one point: Qwen-500m is SOTA. Never thought it could be possible to pour results like this from such a small multilingual model for RAG tasks in French.
from transformers import pipeline
generator = pipeline(model="nielsr/slimsam-50-uniform", task="mask-generation")
outputs = generator(image)