Video-to-Audio Generation with Hidden Alignment
Paper
•
2407.07464
•
Published
•
16
Note 任务:Generating semantically and temporally aligned audio content in accordance with video input 方法:focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. propose VTA-LDM (video to audio latent diffusion model?) simple yet surprisingly effective
Note MJ-Bench is the first benchmark that incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias.
Note a model family designed to enhance natural voice interactions between humans and large language models (LLMs)