MagicTailor:
Component-Controllable Personalization
in Text-to-Image Diffusion Models

arXiv Preprint

1CUHK   2SIAT, CAS   3NUS   4Zhejiang Lab   5Shanghai AI Lab
teaser

We present MagicTailor to enable component-controllable personalization, a newly formulated task aiming to reconfigure specific components of concepts during personalization.

Abstract

Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept.

In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure specific components when personalizing visual concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of the concept and component.

To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg), to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal), to establish a balanced learning paradigm for desired visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.

Component-Controllable Personalization

component-controllable personalization

(a) Illustration of personalization, demonstrating how text-to-image (T2I) diffusion models can learn and reproduce a visual concept from given reference images.

(b) Illustration of component-controllable personalization, depicting a newly formulated task that aims to modify a specific component of a visual concept during personalization.

(c) Example images generated by MagicTailor, showcasing the effectiveness of the proposed MagicTailor, a novel framework that adapts T2I diffusion models for component-controllable personalization.

Challenges

challenges

(a) Semantic pollution: (i) Undesired visual elements may inadvertently disturb the personalized concept. (ii) A simple mask-out strategy is ineffective and causes unintended compositions, whereas (iii) our DM-Deg effectively suppresses unwanted visual semantics, preventing such pollution.

(b) Semantic imbalance: (i) Simultaneously learning the concept and component can lead to imbalance, resulting in concept or component distortion (here we present a case for the former). (ii) Our DS-Bal ensures balanced learning, enhancing personalization performance.

MagicTailor Pipeline

MagicTailor pipeline

Using reference images as the inputs, MagicTailor fine-tunes a T2I diffusion model with low-rank adaption (LoRA) to learn both the target concept and component, enabling the generation of images that seamlessly integrate the component into the concept.

We introduce Dynamic Masked Degradation (DM-Deg), a novel technique for dynamically perturbing undesired visual semantics. This approach helps suppress the model's sensitivity to irrelevant visual details while preserving the overall visual context, thereby effectively mitigating semantic pollution.

Moreover, we employ Dual-Stream Balancing (DS-Bal), a dual-stream learning paradigm designed for balancing the learning of visual semantics, to tackle the issue of semantic imbalance. The online denoising U-Net performs sample-wise min-max optimization, while the momentum denoising U-Net applies selective preserving regularization, ensuring more faithful personalization.

Qualitative Results

qualitative results


qualitative results


qualitative results


qualitative results

We present images generated by MagicTailor and the SOTA methods of personalization for various domains. MagicTailor generally achieves promising text alignment, strong identity fidelity, and high generation quality.

Quantitative Results

quantitative results

We compare our MagicTailor with the SOTA methods of personalization based on automatic metrics (CLIP-T, CLIP-I, DINO, and DreamSim) and user study (human preferences on text alignment, identity fidelity, and generation quality). The best results are marked in bold. MagicTailor can achieve superior performance in this challenging task.

Further Applications

further applications further applications

MagicTailor also has the potential to enable a variety of further applications:

(a) Decoupled generation. MagicTailor can also separately generate the target concept and component, enriching prospective combinations.

(b) Controlling multiple components. MagicTailor shows the potential to handle more than one component, highlighting its effectiveness.

(c) Enhancing other generative tools. MagicTailor can conveniently collaborate with a variety of generative tools that focus on other tasks, equipping them with an additional ability to control the concept’s component in their pipelines.

Please refer to our paper and code for more technical details. ❤️

BibTeX


      @article{zhou2024magictailor,
        title={MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models},
        author={Zhou, Donghao and Huang, Jiancheng and Bai, Jinbin and Wang, Jiaze and Chen, Hao and Chen, Guangyong and Hu, Xiaowei and Heng, Pheng-Ann},
        journal={arXiv preprint arXiv:2410.13370},
        year={2024}
      }