Abstract
Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree (2024)
- Text4Seg: Reimagining Image Segmentation as Text Generation (2024)
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation (2024)
- VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide (2024)
- High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper