MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models
Abstract
Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept. In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure specific components when personalizing visual concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of the concept and component. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal) to establish a balanced learning paradigm for desired visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.
Community
We present MagicTailor to enable component-controllable personalization, a newly formulated task aiming to reconfigure specific components of concepts during personalization.
Page: https://correr-zhou.github.io/MagicTailor/
Paper: https://arxiv.org/pdf/2410.13370
Code: https://github.com/Correr-Zhou/MagicTailor
Why did not you compare the results with InstantID, UniPortrait etc? Your table make no sense since other methods proposed at ancient times.
Hi, thanks for your comments. : )
- The methods you mentioned focus on the domain of human faces for the vanilla personalization task, which are quite different from our setting and thus cannot be adapted to our task for a meaningful comparison.
- Our method follows a widely-adopted tuning-based paradigm, which is still considered a worthwhile technical solution. In light of this, we have compared our methods with recent SOTA tuning-based methods, especially those capable of handling fine-grained visual elements, e.g., Break-A-Scene (Siggraph Asia'23) and CLiC (CVPR'24).
It's always fascinating to see projects like this in 2D image generation/modification since there’s still so much to explore in this field, can't wait to try your code!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder (2024)
- CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization (2024)
- Learning to Customize Text-to-Image Diffusion In Diverse Context (2024)
- TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation (2024)
- StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper