Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published 3 days ago • 46
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling Paper • 2409.19291 • Published 8 days ago • 15
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published 11 days ago • 92
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction Paper • 2409.17422 • Published 10 days ago • 22
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper • 2409.18042 • Published 10 days ago • 34
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models Paper • 2409.17481 • Published 10 days ago • 43
MonoFormer: One Transformer for Both Diffusion and Autoregression Paper • 2409.16280 • Published 12 days ago • 17
OmniBench: Towards The Future of Universal Omni-Language Models Paper • 2409.15272 • Published 13 days ago • 24
A Case Study of Web App Coding with OpenAI Reasoning Models Paper • 2409.13773 • Published 17 days ago • 4
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections Paper • 2409.14677 • Published 13 days ago • 14
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs Paper • 2409.14988 • Published 13 days ago • 21
Phantom of Latent for Large Language and Vision Models Paper • 2409.14713 • Published 13 days ago • 27
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions Paper • 2409.15278 • Published 13 days ago • 22
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization Paper • 2409.12903 • Published 17 days ago • 20
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published 17 days ago • 46
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published 17 days ago • 128
Ovis: Structural Embedding Alignment for Multimodal Large Language Model Paper • 2405.20797 • Published May 31 • 24
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published 18 days ago • 69
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Paper • 2409.10516 • Published 20 days ago • 33
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection Paper • 2409.08513 • Published 23 days ago • 10
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers Paper • 2409.04109 • Published about 1 month ago • 41
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published 24 days ago • 42
Can Large Language Models Unlock Novel Scientific Research Ideas? Paper • 2409.06185 • Published 26 days ago • 10
LLMs Will Always Hallucinate, and We Need to Live With This Paper • 2409.05746 • Published 27 days ago • 2
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper • 2409.06666 • Published 26 days ago • 54
1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit Paper • 2408.14267 • Published Aug 26 • 1
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs Paper • 2409.05152 • Published 28 days ago • 29
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation Paper • 2409.04410 • Published 30 days ago • 23
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency Paper • 2409.02634 • Published Sep 4 • 85
Attention Heads of Large Language Models: A Survey Paper • 2409.03752 • Published about 1 month ago • 86
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published Sep 4 • 54
Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining Paper • 2409.02326 • Published Sep 3 • 16
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Paper • 2409.02813 • Published Sep 4 • 27
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Paper • 2408.16725 • Published Aug 29 • 51
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2 • 104
LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models Paper • 2409.00509 • Published Aug 31 • 38
PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning Paper • 2407.21571 • Published Jul 31 • 1
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models Paper • 2408.15518 • Published Aug 28 • 41
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29 • 56
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28 • 83
SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training Paper • 2407.06654 • Published Jul 9 • 1