Naive DFT |
Adjusted DFT, which also standardize \(A_i\) to 1 |
Comparison of naive and adjusted DFT of the diffusion model. Gray solid lines denote the sampling process. Red dashed lines highlight the gradient computation w.r.t. the model parameter (\(\boldsymbol{\theta}\)). Variables \(\boldsymbol{z}_{t}\) and \(\boldsymbol{\epsilon}^{(t)}\) represent data and noise prediction at time step \(t\). \(\textrm{D}_i\) and \(\textrm{I}_i\) denote the direct and indirect gradient paths between data of adjacent time steps. For instance, at \(t=3\), naive DFT computes the exact gradient \(-A_3\boldsymbol{B}_3\frac{\partial\boldsymbol{\epsilon}^{(3)}}{\partial\boldsymbol{\theta}}\) (defined in Eq. 9 in paper), which involve other time step's noise predictions (through the gradient paths \(\textrm{I}_1\textrm{I}_2\textrm{I}_3\textrm{I}_4\textrm{I}_5\), \(\textrm{I}_1\textrm{I}_2\textrm{D}_2\textrm{I}_5\), and \(\textrm{D}_1\textrm{I}_3\textrm{I}_4\textrm{I}_5\)). Adjusted DFT leverages an adjusted gradient, which removes the coupling with other time steps and standardizes \(A_i\) to 1, for more effective finetuning. |
Training loss for minimizing avg CLIP & DINO similarity |
Estimated gradient scale at different time steps |
The left figure plots the training loss during DFT, w/ three distinct gradients. Each reported w/ 3 random runs. The right figure estimates the scale of these gradients at different time steps. Mean and \(90\%\) CI are computed from 20 random runs. Naive DFT uses the exact gradient, whose norm is illustrated by the "\(|\boldsymbol{R}_tA_t\boldsymbol{B}_t\frac{\partial\boldsymbol{\epsilon}^{(t)}}{\partial\boldsymbol{\theta}}|\)" entry in the right figure. The proposed adjusted DFT is denoted as "ours" entry. |
|
Representation of gender (the left figure) and race (the right four figures) in images generated using 50 occupational test prompts (x-axis). The green horizontal lines denote the desired target distribution. |
|
Generated images using templated prompts with unseen occupations using the original SD (left) and the debiased SD (right). For every image, the first color-coded bar denotes the predicted gender: blue for male and red for female. The second denotes race: green for WMELH, orange for Asian, black for Black, and brown for Indian. Bar height represents prediction confidence. Bounding boxes denote detected faces. For the same prompt, images with the same number label are generated using the same noise. |
Generated Images for non-templated occupational prompts using the original SD (left) and the debiased SD (right). For every image, the first color-coded bar denotes the predicted gender: blue for male and red for female. The second denotes race: green for WMELH, orange for Asian, black for Black, and brown for Indian. Bounding boxes denote detected faces. Bar height represents prediction confidence. For the same prompt, images with the same number label are generated using the same noise. |
A salient feature of our method is its flexibility, allowing users to specify the desired target distribution. In support of this, we demonstrate that our method can effectively adjust the age distribution to achieve a 75% young and 25% old ratio while simultaneously debiasing gender and race. The right figure demonstrates that the original SD displays marked occupational age bias. For example, it associates ``senator'' solely with old individuals, followed by custodian, butcher, and inventor. Our method achieves approximately 25% representation of old individuals for most occupations. And as the below table shows, it neither undermines the efficiency of debiasing gender and race nor negatively impacts the quality of the generated images. |
Generated Images using the original SD (left) and the debiased SD (right). In this figure, the color-coded bar denotes age: red is yound and blue is old. Bar height represents prediction confidence. Bounding boxes denote detected faces. For the same prompt, images with the same number label are generated using the same noise. We do not annotate gender and race for visual clarity. |
Our method is scalable. It can debias multiple concepts at once, such as occupations, sports, and personal descriptors, by expanding the set of prompts used for finetuning.
@inproceedings{shen2023finetuning,
title={Finetuning Text-to-Image Diffusion Models for Fairness},
author={Xudong Shen and Chao Du and Tianyu Pang and Min Lin and Yongkang Wong and Mohan Kankanhalli},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=hnrB5YHoYu}
}