Papers
arxiv:2407.03958

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Published on Jul 4
· Submitted by BK-Lee on Jul 8
Authors:
,

Abstract

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

Community

Paper submitter

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 6

Browse 6 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.03958 in a Space README.md to link it from this page.

Collections including this paper 4