DiffuseKronA

Our method, DiffuseKronA, achieves superior image quality and accurate text-image correspondence across diverse input images
and prompts, all the while upholding exceptional parameter efficiency. In this context,[V]denotes a unique token used
for fine-tuning a specific subject in the text-to-image diffusion model. For more results, please visit gallery!

Abstract

In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis.

Addressing these constraints, we introduce DiffuseKronA, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by up to 35% and 99.947% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, DiffuseKronA mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning.

Evaluated against diverse and complex input images and text prompts, DiffuseKronA consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, thus presenting a substantial advancement in the field of T2I generative modeling.

From Text to your dream images: The overview of an Efficient Diffusion Model

The main idea of DiffuseKronA is to leverage the Kronecker product to decompose the weight matrices of the attention layers in the UNet model. Kronecker Product is a matrix multiplication method, that captures structured relationships and pairwise interactions between elements of two matrices as follows

$$ A \otimes B=\left[\begin{array}{ccc} a_{1,1} B & \cdots & a_{1,a_2} B \\ \vdots & \ddots & \vdots \\ a_{a_1, 1} B & \cdots & a_{a_1, a_2} B \end{array}\right]$$

In contrast to the low-rank decomposition in LoRA, the Kronecker Adapter in DiffuseKronA offers a higher-rank approximation with less parameter count and greater flexibility, such that $W_{\text{pre-trained}}+\Delta W = W_{\text{pre-trained}} + A \otimes B$, where A and B are the Kronecker factors, and ⊗ denotes the Kronecker product. Kronecker Adapter reduces the computational cost by using the following equivalent matrix-vector multiplication: $ (A \otimes B) x=\gamma\left(B \eta_{b_2 \times a_2}(x) A^{\top}\right)$, where $\eta$ is the vectorization operator, and T is the transpose operator.

$$W_{\text{fine-tuned}}=W_{\text{pre-trained}}+\Delta W $$$$ \Delta W =A \otimes {B}$$

👻 DiffuseKronA vs. LoRA-DreamBooth

(1) Superior Fidelity

▷ Our approach consistently produces images of superior fidelity compared to LoRA-DreamBooth.
▷ Notably, theclockgenerated by our method faithfully reproduces the intricate details, such as the exact depiction of the numeral3, mirroring the original image. In contrast, the output from LoRA-DreamBooth exhibits difficulties in achieving such high fidelity.
▷ Additionally, our method demonstrates improved color distribution in the generated images, a feature clearly evident in theRC Carimage. Moreover, it struggles to maintain fidelity to the numeral1on the chest of the sitting toy.

(2) Text Alignment

▷ Our method comprehends the intricacies and complexities of text prompts provided as input, producing images that align with the given text prompts, as depicted below.
▷ The generated image of thecharacterin response to the prompt exemplifies the meticulous attention our method pays to detail. It elegantly captures the presence of"a shop"in the background,"a bowl with noodles" in front of the character, and accompanying soup bowls.
▷ In contrast, LoRA-DreamBooth struggles to generate an image that aligns seamlessly with the complex input prompt.

(3) Superior Stability

▷ DiffuseKronA produces images that closely align with the input images acrossa wide range of learning rates, which are specifically optimized for our approach.
▷ In contrast, LoRA-DreamBooth neglects the significance of input images even within its optimal range. Here, optimal learning rates are determined through extensive experimentation.
▷ The generated images ofthe dogby our method maintain a high degree of similarity to the input images throughout its optimal range, while LoRA-DreamBooth struggles to perform at a comparable level.

(4) One-shot Image Generation

▷ The One-shot generated images are high-quality and accurately represent the text prompts. For instance, the image of the A [V] logo is a yellow smiley face with hands. The "made as a coin" prompt resulted in a grey ghost with a white border, demonstrating the model's ability to incorporate abstract concepts.
▷ The "futuristic neon glow" and "made with watercolours" prompts resulted in a pink and a yellow octopus respectively, showcasing the model's versatility in applying different artistic styles.
▷ The model's ability to generate an image of a guitar-playing octopus on a grey notebook from the prompt"sticker on a notebook"is a testament to its advanced capabilities.

For more such generated images, please visit our gallery!

How efficient is DiffuseKronA

Owing to the intricate structure of the Kronecker adapter, it conducts a harmonious reduction in parameters along with the generation of high-fidelity images, a virtuoso performance that LoRA layers can only envy.
The number of trainable parameters, as depicted in the table, clearly indicates this. DiffuseKronA is $\sim 35\%$ more parameter efficient as compared to LoRA-DreamBooth.
Furthermore, a reduced number of parameters results in a smaller fine-tuning module size, consequently lowering the overall storage requirements.

Comparison of LoRA-DreamBooth v.s. DiffuseKronA.
Backbone	Model	Train. Time	# Param	Model size
SDXL	DiffuseKronA	~ 40 min.	3.8 M	14.95 MB
SDXL	LoRA-DreamBooth	~ 38 min.	5.8 M	22.32 MB
SD	DiffuseKronA	~ 5.52 min.	0.52 M	2.1MB
SD	LoRA-Dreambooth	~ 5.3 min.	1.09 M	4.3MB

State-of-the-art Comparison

We compare DiffuseKronA with four related methods, including $\textbf{LoRA-DreamBooth}$, $\textbf{\(\textbf{DreamBooth}$}\), $\textbf{LoRA-SVDiff}$, $\textbf{SVDiff}$, $\textbf{Custom Diffusion}$, and $\textbf{Textual Inversion}$. We maintain the original settings of all these methods to ensure a fair comparison.
As shown below, our DiffuseKronA generates images that are highly aligned to the input images and constantly incorporates features mentioned in the input text prompt. The better fidelity and in-depth understanding of the input, text prompts are attributed to the structure-preserving ability and better expressiveness of Kronecker product-based adaption.

BibTeX


    @InProceedings{Marjit_2025_WACV,
        author    = {Marjit, Shyam and Singh, Harshit and Mathur, Nityanand and Paul, Sayak and Yu, Chia-Mu and Chen, Pin-Yu},
        title     = {DiffuseKronA: A Parameter Efficient Fine-Tuning Method for Personalized Diffusion Models},
        booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
        month     = {February},
        year      = {2025},
        pages     = {3529-3538}
    }

DiffuseKronA

A Parameter Efficient Fine-tuning Method
for Personalized Diffusion Models

Abstract

From Text to your dream images: The overview of an Efficient Diffusion Model

Unlocking the Optimal Configurations