Project 5A: The Power of Diffusion Models

Part 0: Setup

Using 20 inference steps

"an oil painting of a snowy mountain village"

"a man wearing a hat"

"a rocket ship"

Using 10 inference steps

"an oil painting of a snowy mountain village"

"a man wearing a hat"

"a rocket ship"

Overall, the DeepFloyd model does a good job of generating images that align with the specified prompt. However, its outputs are quite unrealistic and not too high quality. All the images do look artificial and even for realistic prompts like a man wearing a hat, it still looks like it would be an art piece, not real life. Specifically, we see that with lower inference steps, the quality of our output images goes down. We can see for example, that with the man wearing a hat, detail goes down when we reduce the number of inference steps -- we do not see as noticeable lighting effects in the 10-step man as the 20-step man.

I am using a seed of 180 for the project.

Part 1: Sampling Loops

1.1: Implementing the Forward Process

We implement the forward function, which adds a certain amount of noise to an image. The amount of noise added increases depending on the timestep we are on, and this is dictated by the alphas_cumprod associated with the scheduler of the model. The formula used to compute the noisy image is:

We show some results for the Campanile noised according to t=250, 500, and 750 timesteps.

Campanile
at t=0 (not noisy)

Noisy Campanile
at t=250

Noisy Campanile
at t=500

Noisy Campanile
at t=750

1.2 Classical Denoising

We attempt to use the classical Gaussian blur method to denoise our images. We use a kernel_size of 5 and a sigma of 2. The intuition behind this is that Gaussian blurring should attenuate the high-frequency noise. However, this does not provide good results.

Noisy Campanile
at t=250

Noisy Campanile
at t=500

Noisy Campanile
at t=750

Gaussian Blur Denoising
at t=250

Gaussian Blur Denoising
at t=500

Gaussian Blur Denoising
at t=750

1.3 One-Step Denoising

We use the pre-trained Stage 1 DeepFloyd model to do one-step denoising. One-step denoising works by estimating the noise in a noisy image and then removing the noise from the image. Removing the noise from the image is done with the help of the following formula, where we have x_t and a prediction for epsilon, and we try to recover x_0.

We look at the results of one-step denoising for the Campanile noised at different levels (higher t = more noise). One-step denoising does better with lower noise added to the input image, which makes sense because it is easier to extract the signal of the original image if less noise is added.

Original Campanile
at t=250

Noisy Campanile
at t=250

Noisy Campanile
at t=500

Noisy Campanile
at t=750

One-Step Denoised
Campanile at t=250

One-Step Denoised
Campanile at t=500

One-Step Denoised
Campanile at t=750

1.4 Iterative Denoising

Now, we switch to iterative denoising. We start off with a noisy image and apply our UNet to predict the noise added. We remove only some of this noise to end up with a slightly less noisy image, rather than straight to a prediction for the denoised image. We repeat the process until we end up with a denoised image. The following formula shows how we iteratively denoise the image at each timestep.

In our implementation, we start off with the noisy image at timestep t=990 and then denoise for every 30 timesteps, until we reach timestep t=0.

We run iterative denoising on a noisy version of the Campanile. From right to left, we see that iterative denoising gradually denoises the image.

Noisy Campanile
at t=90

Noisy Campanile
at t=240

Noisy Campanile
at t=390

Noisy Campanile
at t=540

Noisy Campanile
at t=690

We also compare to one-step denoised Campanile and find that iterative denoising recovers more of the details. Iterative denoising does far better than classical Gaussian denoising as well.

Original
Campanile

Iteratively Denoised
Campanile

One-Step Denoised
Campanile

Gaussian Blurred
Campanile

1.5 Diffusion Model Sampling

We can generate images from scratch by sampling from the diffusion model and starting off with completely random noise (i.e, i_start = 0), which denoises pure noise. Below are results of following that process for the prompt "a high quality photo".

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6 Classifier-Free Guidance

However, we find that the photos in the previous part are not very high-quality at all. Classifier-free guidance (CFG) can help us generate images that align more to the prompt, and in this case, will have higher quality. We compute noise estimates from the UNet using both the conditional prompt ("a high quality photo") and unconditional prompt (""), and then compute the noise estimate to use for denoising with the following formula:

In our implementation, we use a CFG scale of gamma = 7.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

We find that the generated images are of higher quality.

1.7 Image-to-image translation

Here, we explore the SDEdit algorithm. We take an original image, noise it according to different levels, and then denoise with iterative denoising and CFG with prompt "a high quality photo". We find that the more noise is added to the original image (i.e., for lower values of i_start), the more different the denoised image is from the original image. This makes sense because if not much noise is added to the original image, iterative denoising is more likely to provide a result that lies closer on the image manifold to the original image.

In the following examples that we ran, we see the images look more and more like the original image as we start with higher values of i_start (i.e., less noisy images).

SDEdit with i_start = 1
given campanile

SDEdit with i_start = 3
given campanile

SDEdit with i_start = 5
given campanile

SDEdit with i_start = 7
given campanile

SDEdit with i_start = 10
given campanile

SDEdit with i_start = 20
given campanile

Original
Campanile

SDEdit with i_start = 1
given nuggets

SDEdit with i_start = 3
given nuggets

SDEdit with i_start = 5
given nuggets

SDEdit with i_start = 7
given nuggets

SDEdit with i_start = 10
given nuggets

SDEdit with i_start = 20
given nuggets

Original
Nuggets

SDEdit with i_start = 1
given Yosemite

SDEdit with i_start = 3
given Yosemite

SDEdit with i_start = 5
given Yosemite

SDEdit with i_start = 7
given Yosemite

SDEdit with i_start = 10
given Yosemite

SDEdit with i_start = 20
given Yosemite

Original
Yosemite

1.7.1 Editing Hand-Drawn and Web Images

We apply the SDEdit algorithm on a web image of Calvin and Hobbes and my handdrawn sketches of a monkey and mountains. We see a similar result of the generated images looking more and more like the original image for higher values of i_start.

Calvin and Hobbes
at i_start=1

Calvin and Hobbes
at i_start=3

Calvin and Hobbes
at i_start=5

Calvin and Hobbes
at i_start=7

Calvin and Hobbes
at i_start=10

Calvin and Hobbes
at i_start=20

Original
Calvin and Hobbes

Monkey
at i_start=1

Monkey
at i_start=3

Monkey
at i_start=5

Monkey
at i_start=7

Monkey
at i_start=10

Monkey
at i_start=20

Monkey
Sketch

Mountains
at i_start=1

Mountains
at i_start=3

Mountains
at i_start=5

Mountains
at i_start=7

Mountains
at i_start=10

Mountains
at i_start=20

Mountains
Sketch

1.7.2 Inpainting

We explore inpainting, which allows us to just generate one part of the image. At every timestep, we use the following formula to replace everything outside our mask m with the original image, but properly noised for the timestep.

We experiment with inpainting on the Campanile, the Sun, and the Empire State Building.

Campanile

Mask

Hole to Fill

Campanile Inpainted

Sun

Mask

Hole to Fill

Sun Inpainted

Empire State

Mask

Hole to Fill

Empire State Inpainted

1.7.3 Text-Conditional Image-to-image Translation

Now, we follow the SDEdit algorithm but with a text prompt. We noise original images as in SDEdit. The noisier the noisy image (i.e., lower values of i_start), the more the image will look like the text prompt over the original image, since CFG will be able to guide it over more timesteps.

We experiment with this technique on

Rocket Ship
at noise level 1

Rocket Ship
at noise level 3

Rocket Ship
at noise level 5

Rocket Ship
at noise level 7

Rocket Ship
at noise level 10

Rocket Ship
at noise level 20

Original
Campanile

Dog
at noise level 1

Dog
at noise level 3

Dog
at noise level 5

Dog
at noise level 7

Dog
at noise level 10

Dog
at noise level 20

Original
Sun

Waterfalls
at noise level 1

Waterfalls
at noise level 3

Waterfalls
at noise level 5

Waterfalls
at noise level 7

Waterfalls
at noise level 10

Waterfalls
at noise level 20

Original
Yosemite

1.8 Visual Anagrams

Now, we try to create visual anagrams that look like two different images depending on if it is viewed from the top or bottom. This is done by computing a regular CFG noise estimate with one prompt but then also computing the CFG noise estimate with the other prompt of the flipped image and then flipping that noise estimate. We average the two noise estimates together to get the noise estimate used for iterative denoising. In formulas:

Here are some results along with the two prompts used.

Old Man & Campfire

"an oil painting
of an old man"

"an oil painting of people
around a campfire"

Dog & Hipster Barista

"a photo of a dog"

"a photo of a hipster barista"

Village & Campfire

"an oil painting of a snowy
mountain village"

"an oil painting of people
around a campfire"

1.9 Hybrid Images

We can also create hybrid images using a similar technique as visual anagrams. Here, instead of flipping the noise estimates, we can apply lowpass and highpass filters to the noise estimates. We get the two CFG noise estimates for each of the prompts. For one prompt, we apply a lowpass filter, and for the other, we apply a highpass filter to the noise estimates. We then add together the noise estimates to obtain a noise estimate used for iterative denoising. In formulas:

In our implementation, we use a Gaussian filter of kernel_size=33 and sigma=2 for computing the lowpass and highpass results.

Here are some results.

Hybrid image of a
skull and a waterfall

Hybrid image of a
dog and an old man

Hybrid image of
waterfalls and an old man

Project 5B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

We implement the UNet using simple operations like Conv, DownConv, UpConv, Flatten, Unflatten, and Concat and composed operations like ConvBlock, DownBlock, and UpBlock. Here is the architecture we use:

1.2 Using the UNet to Train a Denoiser

We can train a denoiser that can predict the clean image from a noisy image. To get data to train this denoiser, we take the MNIST digit dataset, and for each image in the dataset, we add noise to it. We can add varying levels of noise, dictated by the sigma parameter, as shown below.

Varying levels of noise on MNIST digits

We eventually train just for sigma = 0.5.

1.2.1 Training

We train the UNet on the MNIST training dataset, generating noisy images using sigma=0.5. We use a hidden dimension in our architecture of 128. We use the Adam optimizer with learning rate=1e-4. We use a batch_size of 256 and train for 5 epochs. Here are our training losses over 5 epochs:

Training loss curve for single-step denoising unconditional UNet

Then, we tested our model on digits from the test set, and we qualitatively looked at how well our model could recover the images noised with sigma=0.5. It seems to do reasonably well. We test using both the model after 1 epoch of training and after all 5 epochs of training -- the model after all 5 epochs seems to do a little better.

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epoch of training

1.2.2 Out-of-Distribution Testing

We also look at how well our denoising model does for other values of sigma that it was not trained for. From the below results, it appears that the denoising model does decently for other values of sigma, except for sigma=1.0.

Results on digits from the test set with varying noise levels

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

Now, we look at implementing DDPM. To do so, we first must inject the timestep t into our architecture. We use linear layers composed in an FCBlock to do so, and add these FCBlocks into our architecture as follows:

2.2 Training the UNet

We train the time-conditioned UNet with the help of the following algorithm. Now, instead of predicting the clean image from a noisy image, we aim to predict how much noise was added to the image from a noisy image. In our algorithm, we take clean images, noise it, predict the noise added, and then compute the loss with the actual noise added.

We train with hidden dimension=64, batch_size=128, and over 20 epochs. We also use an Adam optimizer with initial learning rate=1e-3 and use an exponential learning rate decay scheduler that we step every epoch.

Here are our training losses.

Training loss curve for time-conditioned UNet

2.3 Sampling from the UNet

Once we train our time-conditioned UNet, we can sample from it. We use the following algorithm to accomplish iterative denoising.

We sample results from our model after training for 5 epochs and training for 20 epochs. We see that after training for 20 epochs, our model produces more coherent images of digits.

Sampling results from the 5th epoch for time-conditioned UNet

Sampling results from the 20th epoch for time-conditioned UNet

2.4 Adding Class-Conditioning to UNet

Now, we work to add class-conditioning to our time-conditioned UNet so that we can generate an image of a specific digit. To do so, we inject the class into the architecture at similar places as where we injected the timestep. We fetch classes as well from the MNIST dataset, and then pass in a one-hot vector representing the class into our network. With probability p_uncond = 0.1, we zero out this class conditioning vector so that our UNet will work without being conditioned on the class. Here is the algorithm:

We train with the same hyperparameters as before. Here is the training loss curve:

Training loss curve for time-conditioned UNet

2.5 Sampling from the Class-Conditioned UNet

Once we train our class-conditioned UNet, we can sample from it. We use CFG so that our outputs will look more like the desired class. The unconditional is a zeroed-out class vector, while the conditional is the one-hot class vector of the desired class. Here is the algorithm we follow:

We use a guidance scale of gamma = 5 when sampling. We see that sampling from both the 5th and 20th epoch produces good results, with the 20th epoch's generated images being slightly higher-quality and sharper.

Sampling results from the 5th epoch for class-conditioned UNet

Sampling results from the 20th epoch for class-conditioned UNet