datasets - a zzfive Collection

zzfive 's Collections

3d

image

LLMs

video

agent

cv

audio

robot

datasets

updated about 21 hours ago

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13 • 17
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24 • 13
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20 • 13
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17 • 30
DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17 • 49
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17 • 61
From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14 • 17
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17 • 20
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Paper • 2406.13735 • Published Jun 19 • 5
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Paper • 2406.14599 • Published Jun 20 • 16
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 95
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25 • 7
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Paper • 2407.02371 • Published Jul 2 • 49
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Paper • 2406.19380 • Published Jun 27 • 47
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4 • 18
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Paper • 2407.06358 • Published Jul 8 • 18
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Paper • 2407.10957 • Published Jul 15 • 23
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Paper • 2407.11144 • Published Jul 15 • 7
Visual Text Generation in the Wild

Paper • 2407.14138 • Published Jul 19 • 8
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29 • 11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

Paper • 2408.00205 • Published Aug 1 • 4
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Paper • 2408.02629 • Published Aug 5 • 13
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Paper • 2408.02900 • Published Aug 6 • 25
Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20 • 13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7 • 9
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8 • 14
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Paper • 2407.18245 • Published Jul 25 • 8
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

Paper • 2408.06281 • Published Aug 12 • 9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

Paper • 2408.07089 • Published Aug 9 • 13
DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9 • 11
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15 • 7
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

Paper • 2408.16176 • Published Aug 28 • 7
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution

Paper • 2408.15993 • Published Aug 28 • 7
Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Paper • 2409.01437 • Published Sep 2 • 70
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

Paper • 2409.00447 • Published Aug 31 • 2
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Paper • 2407.17438 • Published Jul 24 • 23
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published 16 days ago • 25
Improving the detection of technical debt in Java source code with an enriched dataset

Paper • 2411.05457 • Published 14 days ago • 2
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

Paper • 2411.05830 • Published 16 days ago • 20
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Paper • 2411.07461 • Published 10 days ago • 21
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Paper • 2411.08380 • Published 9 days ago • 24
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published 3 days ago • 36