multimedia

xb-chang 's Collections

LLMs

updated Jul 22

Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10 • 16

Note 任务：Generating semantically and temporally aligned audio content in accordance with video input 方法：focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. propose VTA-LDM (video to audio latent diffusion model?) simple yet surprisingly effective
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Paper • 2407.04842 • Published Jul 5 • 52

Note MJ-Bench is the first benchmark that incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias.
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Paper • 2407.04051 • Published Jul 4 • 35

Note a model family designed to enhance natural voice interactions between humans and large language models (LLMs)