LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance Paper • 2408.08189 • Published Aug 15 • 14
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation Paper • 2407.15060 • Published Jul 21 • 9
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information Paper • 2402.13616 • Published Feb 21 • 45
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper • 2402.13753 • Published Feb 21 • 111
MusicRL: Aligning Music Generation to Human Preferences Paper • 2402.04229 • Published Feb 6 • 16
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision Paper • 2311.02077 • Published Nov 3, 2023 • 14
Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects Paper • 2211.02247 • Published Nov 4, 2022 • 2
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Paper • 2310.08491 • Published Oct 12, 2023 • 53
How FaR Are Large Language Models From Agents with Theory-of-Mind? Paper • 2310.03051 • Published Oct 4, 2023 • 34
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation Paper • 2309.16429 • Published Sep 28, 2023 • 10
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition Paper • 2309.15223 • Published Sep 26, 2023 • 19
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models Paper • 2309.15103 • Published Sep 26, 2023 • 42
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer Paper • 2308.06873 • Published Aug 14, 2023 • 25
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory Paper • 2308.08089 • Published Aug 16, 2023 • 21