CS180 Project 5a: The Power of Diffusion Models

Part 1: Implementing the Forward Process

The forward process takes a clean image and adds noise to it. This is equivalent to the following:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon$$

where \( \epsilon \sim N(0, 1) \). The first square root term scales the clean image and it decreases over time as t increases, meaning the resultant image is more noisy for larger t. The second term adds gaussian noise to the image. Epsilon needs to be the same size as the original image, and each pixel is independently sampled from a normal distribution.

Part 2: Classical Denoising

Gaussian Blurring removes high-frequeuncy compents (which noise usually is) by smoothing out the image with neighboring pixels getting larger weight. I used a sigma value of 1.5 with kernel size of 5. The results are below, but we see this isn't effective because gaussian blur is just a weighted average at a high level. This means when your image is more noisy, doing weighted averages will still have a lot of noise since there's less and less of the original image.

Part 3: Implementing One Step Denoising

We can rearrange the equation in part 1 to estimate the original image (assuming we know what the noise is).

$$x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} x_t - \frac{\sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t}} \, \epsilon$$

We can use a pre-trained diffusion model to get the noise estimate, and using the image above we can obtain an estimate to the original image. The results are shown below.

Part 4: Iterative Denoising

From the previous part, we see that the diffusion model predicting the noise and us solving for the original image does a much better job of projecting the noisy image on the natural image manifold. However, it clearly does worse as we add more noise. We can solve this problem iteratively by breaking it down into smaller problems. This means we start from the original noisy image then we make the result progressively less noisy. We can do this in steps of 30 to be less expensive relative to using a step size of 1.

\[x_{t'} = \frac{\sqrt{\alpha_t\beta_t}}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t(1-\bar{\alpha}_{t'})}}{1-\bar{\alpha}_t}x_t + v_\sigma\]

Here's the results in each 5th iteration of the denoising loop with the ones being earlier in the denoising loop.

Here are the results for the one-step denoising, gaussian denoising, and final result for the iterative denoising

Part 5: Diffusion Model Sampling

If we start from i_start = 0 and pass in random noise, we can denoise pure noise using the iterative_denoise function made from before. Here are the results. As we can see it's hard to tell what some images even are (like the 2nd and 3rd). Despite this being a 64x64 image (so the resolution is not that great), we should still be able to tell what the image is.

Part 6: Classifier-Free Guidance (CFG)

\[\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)\]

When estimating the noise, a higher lambda will result in a noise more closely following the conditional noise (meaning the resultant image is closer to the conditional text prompt). This helps the diffusion model generate images that are of higher quality (following the text prompt). The results are below.

Part 7: Image-to-image Translation

When we add more noise to the original image, we force the model to hallucinate new things so the denoising process will result in different image (since new things were hallucinated). We can see the results of the different i_start values with higher i_start meaning less noise. I also added an i_start of 30 for good measure.

Part 7.1: Editing Hand-Drawn and Web Images

We run the same code as the last part just on 1 image from the web and 2 scribbles. All three images are non-realistic and should have better results.

Here is our results for our first scribble of a house

CS180 Project 5b: Diffusion Models from Scratch

Part 1: Training a Single-Step Denoising UNet

The first part is correctly producing code for the different blocks of the Unconditional UNet.

  1. Conv: 3x3 convolution with padding=1 and stride=1 + Batch Normalization + GELU
  2. Down Conv: 3x3 convolution with padding=1 and stride=2 + Batch Normalization + GELU
  3. Up Conv: 4x4 inverse convolution (upscaling) with padding=1 and stride=2 + Batch Normalization + GELU
  4. Flatten: 7x7 average pooling + GELU
  5. Unflatten: 7x7 inverse convolution (upscaling) + Batch Noramlization + GELU
  6. Concat: Channel wise concatenation
  7. ConvBlock, DownBlock, UpBlock: All combinations of the single operations above

The decoder downsamples the images to increase the receptive field to help identify patterns and strcutral information in the noisy image. Then we can upsample the image back to its original resolution using skip connections to connect corresponding levels between the downsampling and upsampling paths. This helps preserve the details that might have been lost during downsampling.

To add noise to the original images we can use the following formula: $$z = x + \sigma\epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0,I).$$

We can visualize the process of adding gaussian noise below for different images.

Here is our training loss graph for learning rate 0.0001

Here is our visualization of the denoising after the first epoch.

Here is our visualization of the denoising after the fifth epoch.

The model was trained on sigma=0.5, so we can see how it performs when we try to denoise it for all possible sigma in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Part 2: Training A Diffusion Model (Time-Conditioned)

In the last part, we trained a denoiser that acted as a one-step denoiser (but it only was trained for sigma=0.5). We saw that the out of distribution testing didn't do well when sigma was higher than 0.5. However, if we time condition the UNet, we can effectively denoise any image no matter what the timestep is.

We first need to introduce scalar variable t, representing the timestep. I introduced it in the same way as the project spec. Also if we change the UNet architecture to solve for expected noise, it is the same thing as solving for the denoised image (through manipulation of equations, but logically it's the same exact problem).

Once we do that we need to train the UNet. The way it works is that for each forward pass we need to generate random timesteps for all the images in the batch. Using that, we can add noise to our images. We pass in the noisy images and timesteps as tensors to our UNet. Below is our loss graph for learning rate of 1e-3 and batch size of 128.

We can also sample from the trained model at different epochs to see how the results look like while the model is being trained. To sample an image we can iteratively build the image from timestep 300 to timestep 0. The results from epoch 5 and epoch 20 are shown below.

epoch=5
epoch=20

Part 3: Training A Diffusion Model (Class-Conditioned)

The last part had its limitation in that we can't control what digit we want. We essentially turned pure noise into random numbers. Also the results aren't the best for specific images. We need to implement CFG just like we did in part A of this project to get better results. I implemented the class conditioning the same way in the project specifiction.

Inside each training batch, we change the labels to be one hot encoded for our model to use. Inside the forward pass of the model, we apply the 0.1 dropout meaning that we train our model to generate conditionally and unconditionally. A training loss curve is presented below.

Now that our model is trained condotionally and unconditionally, we are able to apply CFG. For our classes we double the vector with the second half just being unconditional vectors. We repeat the iamges and the timesteps. This allows us to get both the conditional and unconditional noise for each image (allowing us to effectively see what makes each digit that exact digit). When we do (noise_pred_cond - noise_pred_uncond), we find what characteristics make that number appear. We amplify that characteristic in CFG. The result for epoch 5 and epoch 20 are shown below.

epoch=5 (0-4)
epoch=5 (5-9)
epoch=20 (0-4)
epoch=20 (5-9)