DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model

1IIIT Guwahati, 2Hugging Face, 3National Yang Ming Chiao Tung University, 4IBM Research
*Indicates Equal Contribution

Unraveling Textual Descriptions into Artistic Creations


Abstract

In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis. Addressing these constraints, we introduce DiffuseKronA, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by up to 35% and 99.947% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, DiffuseKronA mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning. Evaluated against diverse and complex input images and text prompts, DiffuseKronA consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, thus presenting a substantial advancement in the field of T2I generative modeling.


From Text to your dream images: The overview of an Efficient Diffusion Model


The main idea of DiffuseKronA is to leverage the Kronecker product to decompose the weight matrices of the attention layers in the UNet model. Kronecker Product is a matrix multiplication method, that captures structured relationships and pairwise interactions between elements of two matrices as follows
$$ A \otimes B=\left[\begin{array}{ccc} a_{1,1} B & \cdots & a_{1,a_2} B \\ \vdots & \ddots & \vdots \\ a_{a_1, 1} B & \cdots & a_{a_1, a_2} B \end{array}\right]$$
In contrast to the low-rank decomposition in LoRA, the Kronecker Adapter in DiffuseKronA offers a higher-rank approximation with less parameter count and greater flexibility, such that \(W_{\text{pre-trained}}+\Delta W = W_{\text{pre-trained}} + A \otimes B\), where A and B are the Kronecker factors, and ⊗ denotes the Kronecker product. Kronecker Adapter reduces the computational cost by using the following equivalent matrix-vector multiplication: \( (A \otimes B) x=\gamma\left(B \eta_{b_2 \times a_2}(x) A^{\top}\right)\), where \(\eta\) is the vectorization operator, and T is the transpose operator.
$$W_{\text{fine-tuned}}=W_{\text{pre-trained}}+\Delta W $$$$ \Delta W =A \otimes {B}$$

Unlocking the Optimal Configurations

In our research, we endeavored to address the following inquiries to refine our model:
  1. What is the ideal number of Kronecker factors?
  2. What is the optimal number of training steps?
  3. What is the optimal learning rate?
  4. What are the most effective modules for fine-tuning the model?
  5. What is the impact of the number of training images?

Superior Fidelity and Colour Distribution

Our approach consistently produces images of superior fidelity compared to LoRA-DreamBooth, as illustrated below. Notably, the \(\textit{clock}\) generated by our method faithfully reproduces the intricate details, such as the exact depiction of the numeral \(3\), mirroring the original image. In contrast, the output from LoRA-DreamBooth exhibits difficulties in achieving such high fidelity. Additionally, our method demonstrates improved color distribution in the generated images, a feature clearly evident in the \(\textit{RC Car}\) images in below. Moreover, it struggles to maintain fidelity to the numeral \(1\) on the chest of the sitting toy.

Text Alignment

Our method comprehends the intricacies and complexities of text prompts provided as input, producing images that align with the given text prompts, as depicted below. The generated image of the \(\textit{character}\) in response to the prompt exemplifies the meticulous attention our method pays to detail. It elegantly captures the presence of a shop in the background, a bowl with noodles in front of the character, and accompanying soup bowls. In contrast, LoRA-DreamBooth struggles to generate an image that aligns seamlessly with the complex input prompt. Our method not only generates images that align with text but is also proficient in producing a diverse range of images for a given input.

Superior Stability

DiffuseKronA produces images that closely align with the input images across a wide range of learning rates, which are specifically optimized for our approach. In contrast, LoRA-DreamBooth neglects the significance of input images even within its optimal range. Here, optimal learning rates are determined through extensive experimentation. Additionally, we have considered observations while fine-tuning LoRA-DreamBooth, which is evident in the below figure. The generated images of the dog by our method maintain a high degree of similarity to the input images throughout its optimal range, while LoRA-DreamBooth struggles to perform at a comparable level.

One-shot Image Generation

The images are high-quality and accurately represent the text prompts. They are clear and well-drawn, and the content of each image matches the corresponding text prompt perfectly. For instance, in the figure below the image of the \(\textit{A [V] logo}\) is a yellow smiley face with hands. The \(\textit{made as a coin}\) prompt resulted in a grey ghost with a white border, demonstrating the model's ability to incorporate abstract concepts. The \(\textit{futuristic neon glow}\) and \(\textit{made with watercolours}\) prompts resulted in a pink and a yellow octopus respectively, showcasing the model's versatility in applying different artistic styles. The model's ability to generate an image of a guitar-playing octopus on a grey notebook from the prompt \(\textit{sticker on a notebook}\) is a testament to its advanced capabilities.

How efficient is DiffuseKronA

Owing to the intricate structure of the Kronecker adapter, it conducts a harmonious reduction in parameters along with the generation of high-fidelity images, a virtuoso performance that LoRA layers can only envy. The number of trainable parameters, as depicted in the table below, clearly indicates this. DiffuseKronA is \(\sim 35\%\) more parameter efficient as compared to LoRA-DreamBooth. Furthermore, a reduced number of parameters results in a smaller fine-tuning module size, consequently lowering the overall storage requirements.
Comparison of LoRA-DreamBooth v.s. DiffuseKronA.
Backbone Model Train. Time # Param Model size
SDXL DiffuseKronA ~ 40 min. 3.8 M 14.95 MB
LoRA-DreamBooth ~ 38 min. 5.8 M 22.32 MB
SD DiffuseKronA ~ 5.52 min. 0.52 M 2.1MB
LoRA-Dreambooth ~ 5.3 min. 1.09 M 4.3MB

State-of-the-art Comparison

We compare DiffuseKronA with four related methods, including \(\textbf{LoRA-DreamBooth}\), \(\textbf{\(\textbf{DreamBooth}\)}\), \(\textbf{LoRA-SVDiff}\), \(\textbf{SVDiff}\), \(\textbf{Custom Diffusion}\), and \(\textbf{Textual Inversion}\). We maintain the original settings of all these methods to ensure a fair comparison. As shownbelow, our DiffuseKronA generates images that are highly aligned to the input images and constantly incorporates features mentioned in the input text prompt. The better fidelity and in-depth understanding of the input, text prompts are attributed to the structure-preserving ability and better expressiveness of Kronecker product-based adaption. On the other hand, the images generated by LoRA-DreamBooth often require extensive fine-tuning to achieve the desired results. Methods like custom diffusion take more parameters to fine-tune the model.
comp

Final takeaways

  • Parameter Efficient: A minimum 35% reduction in parameters. By changing Kronecker factors, we can even achieve up to a 75% reduction with results comparable to LoRA-DreamBooth.
  • Enhanced Stability: Our method is more stable compared to LoRA-DreamBooth. Stability refers to variations in images generated across different learning rates and Kronecker factor/ranks, which makes LoRA-DreamBooth harder to fine-tune.
  • Text Alignment and Fidelity: On average, DiffusekronA captures better subject semantics and large contextual prompts.
  • Interpretability: Leverages the advantages of the Kronecker product to capture structured relationships in attention-weight matrices. More controllable decomposition makes DiffusekronA more interpretable.

  • All in all, DiffusekronA outperforms LoRA-DreamBooth in terms of visual quality, text alignment, fidelity, parameter efficiency, and stability.

BibTeX


      @misc{marjit2023diffusekrona,
        title={DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model}, 
        author={Shyam Marjit and Harshit Singh and Nityanand Mathur and Sayak Paul and Chia-Mu Yu and Pin-Yu Chen},
        year={2023},
        eprint={},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }