CS180 Project 5a: The Power of Diffusion Models

Part 1: Implementing the Forward Process

The forward process takes a clean image and adds noise to it. This is equivalent to the following:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon$$

where $ \epsilon \sim N(0, 1) $. The first square root term scales the clean image and it decreases over time as t increases, meaning the resultant image is more noisy for larger t. The second term adds gaussian noise to the image. Epsilon needs to be the same size as the original image, and each pixel is independently sampled from a normal distribution.

Low noise level (t=250)

Medium noise level (t=500)

High noise level (t=750)

Part 2: Classical Denoising

Gaussian Blurring removes high-frequeuncy compents (which noise usually is) by smoothing out the image with neighboring pixels getting larger weight. I used a sigma value of 1.5 with kernel size of 5. The results are below, but we see this isn't effective because gaussian blur is just a weighted average at a high level. This means when your image is more noisy, doing weighted averages will still have a lot of noise since there's less and less of the original image.

Low noise level (t=250)

Low noise level (t=250) Gaussian Blur

Medium noise level (t=500)

Medium noise level (t=500) Gaussian Blur

High noise level (t=750)

High noise level (t=750) Gaussian Blur

Part 3: Implementing One Step Denoising

We can rearrange the equation in part 1 to estimate the original image (assuming we know what the noise is).

$$x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} x_t - \frac{\sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t}} \, \epsilon$$

We can use a pre-trained diffusion model to get the noise estimate, and using the image above we can obtain an estimate to the original image. The results are shown below.

Low noise level (t=250)

(t=250) estimate

Medium noise level (t=500)

(t=500) estimate

High noise level (t=750)

(t=750) estimate

Part 4: Iterative Denoising

From the previous part, we see that the diffusion model predicting the noise and us solving for the original image does a much better job of projecting the noisy image on the natural image manifold. However, it clearly does worse as we add more noise. We can solve this problem iteratively by breaking it down into smaller problems. This means we start from the original noisy image then we make the result progressively less noisy. We can do this in steps of 30 to be less expensive relative to using a step size of 1.

\[x_{t'} = \frac{\sqrt{\alpha_t\beta_t}}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t(1-\bar{\alpha}_{t'})}}{1-\bar{\alpha}_t}x_t + v_\sigma\]

Here's the results in each 5th iteration of the denoising loop with the ones being earlier in the denoising loop.

First 5th iteration

Second 5th iteration

Third 5th iteration

Fourth 5th iteration

Fifth 5th iteration

Here are the results for the one-step denoising, gaussian denoising, and final result for the iterative denoising

One Step

Gaussian

Iterative

Part 5: Diffusion Model Sampling

If we start from i_start = 0 and pass in random noise, we can denoise pure noise using the iterative_denoise function made from before. Here are the results. As we can see it's hard to tell what some images even are (like the 2nd and 3rd). Despite this being a 64x64 image (so the resolution is not that great), we should still be able to tell what the image is.

Part 6: Classifier-Free Guidance (CFG)

\[\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\]

When estimating the noise, a higher lambda will result in a noise more closely following the conditional noise (meaning the resultant image is closer to the conditional text prompt). This helps the diffusion model generate images that are of higher quality (following the text prompt). The results are below.

Part 7: Image-to-image Translation

When we add more noise to the original image, we force the model to hallucinate new things so the denoising process will result in different image (since new things were hallucinated). We can see the results of the different i_start values with higher i_start meaning less noise. I also added an i_start of 30 for good measure.

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

i_start=30

Part 7.1: Editing Hand-Drawn and Web Images

We run the same code as the last part just on 1 image from the web and 2 scribbles. All three images are non-realistic and should have better results.

i_start=1

i_start=3

i_start=5

i_start=7

i_start=10

i_start=20

i_start=30

Here is our results for our first scribble of a house

The Drawing

i_start=1

i_start=3

i_start=5

i_start=7

i_start=12

i_start=15

i_start=17

i_start=20

Here is our results for our second scribble of a tree and a sun.

The Drawing

i_start=1

i_start=3

i_start=5

i_start=7

i_start=12

i_start=15

i_start=17

i_start=20

i_start=30

Part 7.2: Impainting

$$x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{orig}, t)$$

Using the formula above in each iteration of denoising, we can force certain pixels to be the same as xorigin through masking. Here are some masks and results.

Mask 1

Result

Mask 2

Result

Mask 3

Result

Part 7.3: Text-Conditional Image-to-Image Translation

We now change the prompt to guide the resultant image projection with text.

For our first text prompt: "a rocket ship" with a rectangular mask covering the whole campanile.

Noise level 1

Noise level 3

Noise level 5

Noise level 7

Noise level 10

Noise level 20

For our second text prompt: "a pencil" with a rectangular mask covering the whole campanile.

Noise level 1

Noise level 3

Noise level 5

Noise level 7

Noise level 10

Noise level 20

For our third text prompt: "a rocket ship" but this time just a mask on the bottom half of the campnanile.

Noise level 1

Noise level 3

Noise level 5

Noise level 7

Noise level 10

Noise level 20

Part 8: Visual Anagrams

Our goal is to create a single image that looks like one thing in its normal orientation and another thing when flipped. This requires two prompts (one for each orientation). The key idea is we are trying to force each orientation to like each prompt. We already know how to do this normally through iterative denoising. Inside each iterative denoising loop, we need to also calculate the noise in the flipped direction. We need to average the two noises along with the variances. A key part is to flip the noise of the other orientations. This is because we add the noise back in the normal orientation.

An Oil Painting of an Old Man

An Oil Painting of People around a Campfire (flipped)

A man

A dog (flipped)

An oil painting of a snowy mountain village

An Oil Painting of People around a Campfire (flipped)

Part 9: Hybrid Images

In order to create hybrid images, we need to get to do a lowpass filter and get the low frequency noise estimate of one image and the high frequency noise estimate of another image. We combine these values to get a new hybrid noise estimate. We can implement the low frequency and high frequency noise estimates using the gaussian filter. Here are some results.

Skull from afar and waterfall up close.

Oil painting of snowy village from afar and an oil painting of a man up close.

Oil painting of people around a campfire from afar and an oil painting of a man up close.

CS180 Project 5b: Diffusion Models from Scratch

Part 1: Training a Single-Step Denoising UNet

The first part is correctly producing code for the different blocks of the Unconditional UNet.

Conv: 3x3 convolution with padding=1 and stride=1 + Batch Normalization + GELU
Down Conv: 3x3 convolution with padding=1 and stride=2 + Batch Normalization + GELU
Up Conv: 4x4 inverse convolution (upscaling) with padding=1 and stride=2 + Batch Normalization + GELU
Flatten: 7x7 average pooling + GELU
Unflatten: 7x7 inverse convolution (upscaling) + Batch Noramlization + GELU
Concat: Channel wise concatenation
ConvBlock, DownBlock, UpBlock: All combinations of the single operations above

The decoder downsamples the images to increase the receptive field to help identify patterns and strcutral information in the noisy image. Then we can upsample the image back to its original resolution using skip connections to connect corresponding levels between the downsampling and upsampling paths. This helps preserve the details that might have been lost during downsampling.

To add noise to the original images we can use the following formula: $$z = x + \sigma\epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0,I).$$

We can visualize the process of adding gaussian noise below for different images.

sigma=0

sigma=0.2

sigma=0.4

sigma=0.5

sigma=0.6

sigma=0.8

sigma=1

sigma=0

sigma=0.2

sigma=0.4

sigma=0.5

sigma=0.6

sigma=0.8

sigma=1

sigma=0

sigma=0.2

sigma=0.4

sigma=0.5

sigma=0.6

sigma=0.8

sigma=1

Here is our training loss graph for learning rate 0.0001

Here is our visualization of the denoising after the first epoch.

Here is our visualization of the denoising after the fifth epoch.

The model was trained on sigma=0.5, so we can see how it performs when we try to denoise it for all possible sigma in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

sigma=0

sigma=0.2

sigma=0.4

sigma=0.5

sigma=0.6

sigma=0.8

sigma=1

Part 2: Training A Diffusion Model (Time-Conditioned)

In the last part, we trained a denoiser that acted as a one-step denoiser (but it only was trained for sigma=0.5). We saw that the out of distribution testing didn't do well when sigma was higher than 0.5. However, if we time condition the UNet, we can effectively denoise any image no matter what the timestep is.

We first need to introduce scalar variable t, representing the timestep. I introduced it in the same way as the project spec. Also if we change the UNet architecture to solve for expected noise, it is the same thing as solving for the denoised image (through manipulation of equations, but logically it's the same exact problem).

Once we do that we need to train the UNet. The way it works is that for each forward pass we need to generate random timesteps for all the images in the batch. Using that, we can add noise to our images. We pass in the noisy images and timesteps as tensors to our UNet. Below is our loss graph for learning rate of 1e-3 and batch size of 128.

We can also sample from the trained model at different epochs to see how the results look like while the model is being trained. To sample an image we can iteratively build the image from timestep 300 to timestep 0. The results from epoch 5 and epoch 20 are shown below.

epoch=5

epoch=20

Part 3: Training A Diffusion Model (Class-Conditioned)

The last part had its limitation in that we can't control what digit we want. We essentially turned pure noise into random numbers. Also the results aren't the best for specific images. We need to implement CFG just like we did in part A of this project to get better results. I implemented the class conditioning the same way in the project specifiction.

Inside each training batch, we change the labels to be one hot encoded for our model to use. Inside the forward pass of the model, we apply the 0.1 dropout meaning that we train our model to generate conditionally and unconditionally. A training loss curve is presented below.

Now that our model is trained condotionally and unconditionally, we are able to apply CFG. For our classes we double the vector with the second half just being unconditional vectors. We repeat the iamges and the timesteps. This allows us to get both the conditional and unconditional noise for each image (allowing us to effectively see what makes each digit that exact digit). When we do (noise_pred_cond - noise_pred_uncond), we find what characteristics make that number appear. We amplify that characteristic in CFG. The result for epoch 5 and epoch 20 are shown below.

epoch=5 (0-4)

epoch=5 (5-9)

epoch=20 (0-4)

epoch=20 (5-9)