Using 20 inference steps |
"an oil painting of a snowy mountain village" |
"a man wearing a hat" |
"a rocket ship" |
Using 10 inference steps |
"an oil painting of a snowy mountain village" |
"a man wearing a hat" |
"a rocket ship" |
Overall, the DeepFloyd model does a good job of generating images that align with the specified prompt. However, its outputs are quite unrealistic and not too high quality. All the images do look artificial and even for realistic prompts like a man wearing a hat, it still looks like it would be an art piece, not real life. Specifically, we see that with lower inference steps, the quality of our output images goes down. We can see for example, that with the man wearing a hat, detail goes down when we reduce the number of inference steps -- we do not see as noticeable lighting effects in the 10-step man as the 20-step man.
I am using a seed of 180 for the project.
We implement the forward function, which adds a certain amount of noise to an image. The amount of noise added increases depending on the timestep we are on, and this is dictated by the alphas_cumprod associated with the scheduler of the model. The formula used to compute the noisy image is:
We show some results for the Campanile noised according to t=250, 500, and 750 timesteps.
Campanile |
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
We attempt to use the classical Gaussian blur method to denoise our images. We use a kernel_size of 5 and a sigma of 2. The intuition behind this is that Gaussian blurring should attenuate the high-frequency noise. However, this does not provide good results.
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
Gaussian Blur Denoising |
Gaussian Blur Denoising |
Gaussian Blur Denoising |
We use the pre-trained Stage 1 DeepFloyd model to do one-step denoising. One-step denoising works by estimating the noise in a noisy image and then removing the noise from the image. Removing the noise from the image is done with the help of the following formula, where we have x_t and a prediction for epsilon, and we try to recover x_0.
We look at the results of one-step denoising for the Campanile noised at different levels (higher t = more noise). One-step denoising does better with lower noise added to the input image, which makes sense because it is easier to extract the signal of the original image if less noise is added.
Original Campanile |
||
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
One-Step Denoised |
One-Step Denoised |
One-Step Denoised |
Now, we switch to iterative denoising. We start off with a noisy image and apply our UNet to predict the noise added. We remove only some of this noise to end up with a slightly less noisy image, rather than straight to a prediction for the denoised image. We repeat the process until we end up with a denoised image. The following formula shows how we iteratively denoise the image at each timestep.
In our implementation, we start off with the noisy image at timestep t=990 and then denoise for every 30 timesteps, until we reach timestep t=0.
We run iterative denoising on a noisy version of the Campanile. From right to left, we see that iterative denoising gradually denoises the image.
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
Noisy Campanile |
We also compare to one-step denoised Campanile and find that iterative denoising recovers more of the details. Iterative denoising does far better than classical Gaussian denoising as well.
Original |
Iteratively Denoised |
One-Step Denoised |
Gaussian Blurred |
We can generate images from scratch by sampling from the diffusion model and starting off with completely random noise (i.e, i_start = 0), which denoises pure noise. Below are results of following that process for the prompt "a high quality photo".
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
Sample 5 |
However, we find that the photos in the previous part are not very high-quality at all. Classifier-free guidance (CFG) can help us generate images that align more to the prompt, and in this case, will have higher quality. We compute noise estimates from the UNet using both the conditional prompt ("a high quality photo") and unconditional prompt (""), and then compute the noise estimate to use for denoising with the following formula:
In our implementation, we use a CFG scale of gamma = 7.
Sample 1 with CFG |
Sample 2 with CFG |
Sample 3 with CFG |
Sample 4 with CFG |
Sample 5 with CFG |
We find that the generated images are of higher quality.
Here, we explore the SDEdit algorithm. We take an original image, noise it according to different levels, and then denoise with iterative denoising and CFG with prompt "a high quality photo". We find that the more noise is added to the original image (i.e., for lower values of i_start), the more different the denoised image is from the original image. This makes sense because if not much noise is added to the original image, iterative denoising is more likely to provide a result that lies closer on the image manifold to the original image.
In the following examples that we ran, we see the images look more and more like the original image as we start with higher values of i_start (i.e., less noisy images).
SDEdit with i_start = 1 |
SDEdit with i_start = 3 |
SDEdit with i_start = 5 |
SDEdit with i_start = 7 |
SDEdit with i_start = 10 |
SDEdit with i_start = 20 |
Original |
SDEdit with i_start = 1 |
SDEdit with i_start = 3 |
SDEdit with i_start = 5 |
SDEdit with i_start = 7 |
SDEdit with i_start = 10 |
SDEdit with i_start = 20 |
Original |
SDEdit with i_start = 1 |
SDEdit with i_start = 3 |
SDEdit with i_start = 5 |
SDEdit with i_start = 7 |
SDEdit with i_start = 10 |
SDEdit with i_start = 20 |
Original |
We apply the SDEdit algorithm on a web image of Calvin and Hobbes and my handdrawn sketches of a monkey and mountains. We see a similar result of the generated images looking more and more like the original image for higher values of i_start.
Calvin and Hobbes |
Calvin and Hobbes |
Calvin and Hobbes |
Calvin and Hobbes |
Calvin and Hobbes |
Calvin and Hobbes |
Original |
Monkey |
Monkey |
Monkey |
Monkey |
Monkey |
Monkey |
Monkey |
Mountains |
Mountains |
Mountains |
Mountains |
Mountains |
Mountains |
Mountains |
We explore inpainting, which allows us to just generate one part of the image. At every timestep, we use the following formula to replace everything outside our mask m with the original image, but properly noised for the timestep.
We experiment with inpainting on the Campanile, the Sun, and the Empire State Building.
Campanile |
Mask |
Hole to Fill |
Campanile Inpainted |
Sun |
Mask |
Hole to Fill |
Sun Inpainted |
Empire State |
Mask |
Hole to Fill |
Empire State Inpainted |
Now, we follow the SDEdit algorithm but with a text prompt. We noise original images as in SDEdit. The noisier the noisy image (i.e., lower values of i_start), the more the image will look like the text prompt over the original image, since CFG will be able to guide it over more timesteps.
We experiment with this technique on
Rocket Ship |
Rocket Ship |
Rocket Ship |
Rocket Ship |
Rocket Ship |
Rocket Ship |
Original |
Dog |
Dog |
Dog |
Dog |
Dog |
Dog |
Original |
Waterfalls |
Waterfalls |
Waterfalls |
Waterfalls |
Waterfalls |
Waterfalls |
Original |
Now, we try to create visual anagrams that look like two different images depending on if it is viewed from the top or bottom. This is done by computing a regular CFG noise estimate with one prompt but then also computing the CFG noise estimate with the other prompt of the flipped image and then flipping that noise estimate. We average the two noise estimates together to get the noise estimate used for iterative denoising. In formulas:
Here are some results along with the two prompts used.
Old Man & Campfire |
"an oil painting |
"an oil painting of people |
Dog & Hipster Barista |
"a photo of a dog" |
"a photo of a hipster barista" |
Village & Campfire |
"an oil painting of a snowy |
"an oil painting of people |
We can also create hybrid images using a similar technique as visual anagrams. Here, instead of flipping the noise estimates, we can apply lowpass and highpass filters to the noise estimates. We get the two CFG noise estimates for each of the prompts. For one prompt, we apply a lowpass filter, and for the other, we apply a highpass filter to the noise estimates. We then add together the noise estimates to obtain a noise estimate used for iterative denoising. In formulas:
In our implementation, we use a Gaussian filter of kernel_size=33 and sigma=2 for computing the lowpass and highpass results.
Here are some results.
Hybrid image of a |
Hybrid image of a |
Hybrid image of |
We implement the UNet using simple operations like Conv, DownConv, UpConv, Flatten, Unflatten, and Concat and composed operations like ConvBlock, DownBlock, and UpBlock. Here is the architecture we use:
We can train a denoiser that can predict the clean image from a noisy image. To get data to train this denoiser, we take the MNIST digit dataset, and for each image in the dataset, we add noise to it. We can add varying levels of noise, dictated by the sigma parameter, as shown below.
Varying levels of noise on MNIST digits |
We eventually train just for sigma = 0.5.
We train the UNet on the MNIST training dataset, generating noisy images using sigma=0.5. We use a hidden dimension in our architecture of 128. We use the Adam optimizer with learning rate=1e-4. We use a batch_size of 256 and train for 5 epochs. Here are our training losses over 5 epochs:
Training loss curve for single-step denoising unconditional UNet |
Then, we tested our model on digits from the test set, and we qualitatively looked at how well our model could recover the images noised with sigma=0.5. It seems to do reasonably well. We test using both the model after 1 epoch of training and after all 5 epochs of training -- the model after all 5 epochs seems to do a little better.
Results on digits from the test set after 1 epoch of training |
Results on digits from the test set after 5 epoch of training |
We also look at how well our denoising model does for other values of sigma that it was not trained for. From the below results, it appears that the denoising model does decently for other values of sigma, except for sigma=1.0.
Results on digits from the test set with varying noise levels |
Now, we look at implementing DDPM. To do so, we first must inject the timestep t into our architecture. We use linear layers composed in an FCBlock to do so, and add these FCBlocks into our architecture as follows:
We train the time-conditioned UNet with the help of the following algorithm. Now, instead of predicting the clean image from a noisy image, we aim to predict how much noise was added to the image from a noisy image. In our algorithm, we take clean images, noise it, predict the noise added, and then compute the loss with the actual noise added.
We train with hidden dimension=64, batch_size=128, and over 20 epochs. We also use an Adam optimizer with initial learning rate=1e-3 and use an exponential learning rate decay scheduler that we step every epoch.
Here are our training losses.
Training loss curve for time-conditioned UNet |
Once we train our time-conditioned UNet, we can sample from it. We use the following algorithm to accomplish iterative denoising.
We sample results from our model after training for 5 epochs and training for 20 epochs. We see that after training for 20 epochs, our model produces more coherent images of digits.
Sampling results from the 5th epoch for time-conditioned UNet |
Sampling results from the 20th epoch for time-conditioned UNet |
Now, we work to add class-conditioning to our time-conditioned UNet so that we can generate an image of a specific digit. To do so, we inject the class into the architecture at similar places as where we injected the timestep. We fetch classes as well from the MNIST dataset, and then pass in a one-hot vector representing the class into our network. With probability p_uncond = 0.1, we zero out this class conditioning vector so that our UNet will work without being conditioned on the class. Here is the algorithm:
We train with the same hyperparameters as before. Here is the training loss curve:
Training loss curve for time-conditioned UNet |
Once we train our class-conditioned UNet, we can sample from it. We use CFG so that our outputs will look more like the desired class. The unconditional is a zeroed-out class vector, while the conditional is the one-hot class vector of the desired class. Here is the algorithm we follow:
We use a guidance scale of gamma = 5 when sampling. We see that sampling from both the 5th and 20th epoch produces good results, with the 20th epoch's generated images being slightly higher-quality and sharper.
Sampling results from the 5th epoch for class-conditioned UNet |
Sampling results from the 20th epoch for class-conditioned UNet |