Generative AI

Image-to-Image Translation with DeciDiffusion: A Developer’s Guide

Introduction

DeciDiffusion, developed by Deci, is a robust text-to-image latent diffusion model.

With 1.02 billion parameters, it surpasses the performance of models like Stable Diffusion v1.5, delivering high-quality images efficiently. When integrated with Deci’s Inference SDK, Infery, DeciDiffusion’s speed is optimized, producing results quickly on NVIDIA A10G GPUs. The model’s architecture, driven by Deci’s Neural Architecture Search engine, AutoNAC, is tailored for optimal inference efficiency.

While DeciDiffusion excels in text-to-image generation, there’s a clear opportunity to expand its capabilities to image-to-image translation—a domain with vast applications from design to medical imaging.

Adapting DeciDiffusion can meet the industry’s need for fast and precise image-to-image models.

In this guide, I’ll walk you through the process of adapting DeciDiffusion for image-to-image tasks, highlighting the necessary architectural changes and practical steps.


The Science Behind Image-to-Image Translation

Image-to-image translation is the process of taking an input image and transforming it into a different visual representation.

At its core, it’s mapping the content of one image to another while retaining the essence of the original. This can range from tasks like turning a sketch into colourful artwork, changing day scenes to night, or even converting satellite images into detailed maps.

However, this translation has its challenges.

The model must discern intricate details, understand context, and produce visually appealing and accurate outputs. Minor discrepancies can lead to significant inaccuracies in the translated image.

That’s why a robust model is crucial.


Diving into SDEdit: HuggingFace’s Diffusers Library Technique

SDEdit, or Stochastic Differential Editing, is a the technique for image-to-image diffusion that HuggingFace’s Diffusers library uses.

SDEdit leverages the power of stochastic differential equations (SDEs) for image synthesis.

SDEdit synthesizes images by iterative denoising through an SDE.

Here’s how it works: Given an input image with user guidance, such as a sketch or outline, SDEdit introduces noise to this input. It then employs the SDE to denoise the resulting image, enhancing its realism. This iterative denoising mechanism allows SDEdit to balance staying true to the user’s input and producing a realistic output.


But why opt for SDEdit over other methods?

Traditional Generative Adversarial Networks (GANs) often necessitate additional training data or specialized loss functions for individual applications.

SDEdit sidesteps these requirements.

It doesn’t demand task-specific training or inversions, making it a versatile tool for various image synthesis tasks.

In practical terms, with SDEdit, developers and artists can achieve a harmonious blend of realism and faithfulness in their generated images without the overhead of extensive training or customization.

%%capture
!pip install diffusers transformers accelerate
import os
import time
from io import BytesIO

import matplotlib.pyplot as plt
import requests
import random
import itertools
import torch
from IPython.core.display import display, HTML
from matplotlib import pyplot as plt
from PIL import Image
from tqdm.notebook import tqdm
from transformers import CLIPImageProcessor, CLIPTokenizer, CLIPTextModel

from diffusers import StableDiffusionImg2ImgPipeline, StableDiffusionPipeline
from diffusers.models import AutoencoderKL, UNet2DConditionModel
from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
from diffusers.schedulers import DDIMScheduler


Instantiating and Adapting DeciDiffusion for Image-to-Image Tasks


Initial Instantiation of DeciDiffusion:

Before adapting DeciDiffusion for image-to-image tasks, you first instantiated an instance of DeciDiffusion using the StableDiffusionPipeline

In this step:

  • The from_pretrained method is used to load the DeciDiffusion-v1-0 model.
  • The custom_pipeline argument specifies the custom pipeline to be used, which in this case is also set to Deci/DeciDiffusion-v1-0.
  • The data type for PyTorch tensors is explicitly set to torch.float16 for optimized performance.
deci_diffusion_pipeline = StableDiffusionPipeline.from_pretrained('Deci/DeciDiffusion-v1-0',
                                                   custom_pipeline='Deci/DeciDiffusion-v1-0',
                                                   torch_dtype=torch.float16
                                                   )


The Specialized UNet-NAS in DeciDiffusion

U-Net is a popular convolutional neural network architecture primarily used for biomedical image segmentation.

However, DeciDiffusion takes this a step further by utilizing a specialized version called UNet-NAS (Neural Architecture Search). This variant of U-Net is designed to automatically search for the best network architecture, optimizing its performance for specific tasks.


Instantiating UNet-NAS for DeciDiffusion:

To harness the capabilities of UNet-NAS within DeciDiffusion, you instantiated it from the deci_diffusion_pipeline.

In this instantiation:

  • The from_pretrained method is employed to load the UNet-NAS model specific to Deci/DeciDiffusion-v1-0.
  • The subfolder argument points to flexible_unet, indicating the location where the specialized U-Net model is stored.
  • Similar to the earlier instantiation, the data type for PyTorch tensors is set to torch.float16 for optimized computational performance.


Why UNet-NAS?

The adoption of UNet-NAS in DeciDiffusion offers several advantages:

  • Optimized Architecture: UNet-NAS automatically searches for the most efficient network structure, ensuring optimal performance for image-to-image tasks.
  • Flexibility: Given its neural architecture search capability, UNet-NAS can be tailored to various tasks, making it a versatile choice for diverse image processing challenges.
  • Enhanced Performance: By leveraging a network structure that’s specifically optimized for the task at hand, UNet-NAS can deliver superior results compared to standard U-Net architectures.
unet_nas = deci_diffusion_pipeline.unet.from_pretrained('Deci/DeciDiffusion-v1-0',
                                              subfolder='flexible_unet',
                                              torch_dtype=torch.float16)
The config attributes {'act_fn': 'silu', 'addition_embed_type': None, 'addition_embed_type_num_heads': 64, 'addition_time_embed_dim': None, 'attention_head_dim': 8, 'attention_type': 'default', 'block_out_channels': [320, 640, 1280, 1280], 'center_input_sample': False, 'class_embed_type': None, 'class_embeddings_concat': False, 'configurations': {'add_downsample': [True, True, True], 'add_upsample': [True, True, False], 'add_upsample_mid_block': True, 'cross_attention_dim': 768, 'down_blocks_in_channels': [320, 320, 640], 'down_blocks_num_attentions': [0, 1, 3], 'down_blocks_num_resnets': [2, 2, 1], 'down_blocks_out_channels': [320, 640, 1280], 'mid_num_attentions': 2, 'mid_num_resnets': 4, 'mix_block_in_forward': True, 'num_attention_heads': 8, 'prev_output_channels': [1280, 1280, 640], 'resnet_act_fn': 'silu', 'resnet_eps': 1e-05, 'sample_size': 64, 'temb_dim': 1280, 'up_blocks_num_attentions': [6, 3, 0], 'up_blocks_num_resnets': [2, 3, 3]}, 'conv_in_kernel': 3, 'conv_out_kernel': 3, 'cross_attention_dim': 768, 'cross_attention_norm': None, 'down_block_types': ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D'], 'downsample_padding': 1, 'dual_cross_attention': False, 'encoder_hid_dim': None, 'encoder_hid_dim_type': None, 'flip_sin_to_cos': True, 'freq_shift': 0, 'in_channels': 4, 'layers_per_block': 2, 'mid_block_only_cross_attention': None, 'mid_block_scale_factor': 1, 'mid_block_type': 'UNetMidBlock2DCrossAttn', 'norm_eps': 1e-05, 'norm_num_groups': 32, 'num_attention_heads': None, 'num_class_embeds': None, 'only_cross_attention': False, 'out_channels': 4, 'projection_class_embeddings_input_dim': None, 'resnet_out_scale_factor': 1.0, 'resnet_skip_time_act': False, 'resnet_time_scale_shift': 'default', 'sample_size': 64, 'time_cond_proj_dim': None, 'time_embedding_act_fn': None, 'time_embedding_dim': None, 'time_embedding_type': 'positional', 'timestep_post_act': None, 'transformer_layers_per_block': 1, 'up_block_types': ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D'], 'upcast_attention': False, 'use_linear_projection': False} were passed to FlexibleUNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.


Preparing DeciDiffusion for Image-to-Image Tasks Using StableDiffusionImg2ImgPipeline


What is the StableDiffusionImg2ImgPipeline?

The StableDiffusionImg2ImgPipeline is a specialized pipeline provided by HuggingFace’s Diffusers library.

It’s designed to apply stable diffusion techniques to images, allowing for sophisticated image transformations based on the principles of stochastic differential editing.


How to Use StableDiffusionImg2ImgPipeline to Adapt DeciDiffusion for Image-to-Image Tasks

To transform DeciDiffusion into an image-to-image model, you can create a subclass of the StableDiffusionImg2ImgPipeline.

This subclass, DeciDiffusionPipeline_img2img, will inherit the foundational capabilities of the parent pipeline but will be tailored with custom functionalities specific to DeciDiffusion.

class DeciDiffusionPipeline_img2img(StableDiffusionImg2ImgPipeline):
    deci_default_number_of_iterations = 30
    deci_default_guidance_rescale = 0.7

    def __init__(self,
                 vae: AutoencoderKL,
                 text_encoder: CLIPTextModel,
                 tokenizer: CLIPTokenizer,
                 unet: UNet2DConditionModel,
                 scheduler: DDIMScheduler,
                 safety_checker: StableDiffusionSafetyChecker,
                 feature_extractor: CLIPImageProcessor,
                 requires_safety_checker: bool = True
                 ):


Components of the DeciDiffusionPipeline_img2img

  • vae (AutoencoderKL): A Variational Auto-Encoder (VAE) model that encodes and decodes images to and from latent representations. VAEs are powerful tools for generating new images by sampling from the latent space.
  • text_encoder (CLIPTextModel): A frozen text-encoder designed to convert textual descriptions into embeddings that can be used in conjunction with image data.
  • tokenizer (CLIPTokenizer): The CLIPTokenizer is used to tokenize text, converting human-readable text into a format that the CLIPTextModel can understand and process.
  • unet (UNet2DConditionModel): The UNet2DConditionModel is a neural network architecture designed for image denoising. It’s structured to process the encoded image latents and refine them, enhancing the quality of the generated images.
  • scheduler (SchedulerMixin): This component determines how the unet denoises the encoded image latents. It can be one of several types, including DDIMSchedulerLMSDiscreteScheduler, or PNDMScheduler.
  • safety_checker (StableDiffusionSafetyChecker): A classification module that evaluates generated images to determine if they could be considered offensive or harmful. It’s an essential component to ensure the responsible generation of images. More details about potential harms can be found in the model card.
  • feature_extractor (CLIPImageProcessor): The CLIPImageProcessor extracts features from the generated images. These features are then used as inputs to the safety_checker, ensuring that the generated images adhere to safety standards.
class DeciDiffusionPipeline_img2img(StableDiffusionImg2ImgPipeline):
    deci_default_number_of_iterations = 30
    deci_default_guidance_rescale = 0.7

    def __init__(self,
                 vae: AutoencoderKL,
                 text_encoder: CLIPTextModel,
                 tokenizer: CLIPTokenizer,
                 unet: UNet2DConditionModel,
                 scheduler: DDIMScheduler,
                 safety_checker: StableDiffusionSafetyChecker,
                 feature_extractor: CLIPImageProcessor,
                 requires_safety_checker: bool = True
                 ):

        super().__init__(vae=vae,
                         text_encoder=text_encoder,
                         tokenizer=tokenizer,
                         unet=unet,
                         scheduler=scheduler,
                         safety_checker=safety_checker,
                         feature_extractor=feature_extractor,
                         requires_safety_checker=requires_safety_checker
                         )

        self.register_modules(vae=vae,
                              text_encoder=text_encoder,
                              tokenizer=tokenizer,
                              unet=unet,
                              scheduler=scheduler,
                              safety_checker=safety_checker,
                              feature_extractor=feature_extractor)

    def __call__(self, *args, **kwargs):
        if "num_inference_steps" not in kwargs:
            kwargs.update({'num_inference_steps': self.deci_default_number_of_iterations})
        return super().__call__(*args, **kwargs)


Adapting DeciDiffusion for Image-to-Image Tasks

With the deci_diffusion_pipeline instance in place, you then utilized its components to instantiate the DeciDiffusionPipeline_img2img, a subclass of StableDiffusionImg2ImgPipeline

Here:

  • Components like vaetext_encodertokenizerschedulersafety_checker, and feature_extractor are directly sourced from the previously instantiated deci_diffusion_pipeline.
  • The unet component is set to a custom unet_nas.
  • The safety checker requirement is turned off by setting requires_safety_checker to False.

This approach ensures that DeciDiffusion is seamlessly adapted to handle image-to-image tasks, leveraging the foundational capabilities of the StableDiffusionImg2ImgPipeline and customizing it to meet specific requirements.

pipe = DeciDiffusionPipeline_img2img(
    vae = deci_diffusion_pipeline.vae,
    text_encoder = deci_diffusion_pipeline.text_encoder,
    tokenizer = deci_diffusion_pipeline.tokenizer,
    unet = unet_nas,
    scheduler = deci_diffusion_pipeline.scheduler,
    safety_checker = deci_diffusion_pipeline.safety_checker,
    feature_extractor = deci_diffusion_pipeline.feature_extractor,
    requires_safety_checker = False
).to('cuda')


Let’s see what we can transform the following image into

# download image
url = "https://www.easypeasyandfun.com/wp-content/uploads/2021/07/Step-8-1.png"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image
prompt = "A cute baby dragon with blue and green scales, artstation painting"
# helper function to plot
def plot_images(images, values, title):
    """
    Plot a series of images side by side with their respective titles.

    Parameters:
    - images (list): A list of images to be plotted.
    - values (list): A list of values corresponding to each image, used for the title.
    - title (str): The main title for the images, used as a prefix for each image's title.

    Returns:
    None. Displays the images and saves the plot as a PNG file.
    """

    # Create a subplot with 1 row and as many columns as there are images/values
    fig, axs = plt.subplots(1, len(values), figsize=(15, 5))

    # Loop through each axis and image to display them
    for i, ax in enumerate(axs):
        ax.imshow(images[i])  # Display the image on the axis
        ax.set_title(f"{title}: {values[i]}")  # Set the title for the image using the provided value
        ax.axis('off')  # Turn off the axis for a cleaner look

    plt.tight_layout()  # Adjust the layout to minimize overlaps
    plt.savefig(f"{title}_experiments.png")  # Save the combined plot as a PNG file
    plt.show()  # Display the plot


Understanding geneartion parameters for Img2Img pipelines

There are a few important parametrs that you can use to control the result of the image-to-image pipeline

  • strength
  • guidance_scale
  • num_inference_steps


Asessing impact of strength on generated image.

In the Image to Image mode of Stable Diffusion, the strength parameter plays a pivotal role in determining the noise level introduced to the starting image during its generation.

  • Near 0: Setting strength close to 0 yields an outcome nearly identical to the original image.
  • At 1: A strength value of 1 introduces the maximum noise, making the image vastly different from the original. At this setting, the reference image is essentially disregarded, and the denoising process runs for the full count of iterations specified in num_inference_steps.

For a harmonious blend of the original image’s elements and new concepts, it’s advised to start with strength values between 0.4 and 0.6 and experiment from there. This range offers a balanced transformation, ensuring both familiarity and novelty in the output.

The strength parameter essentially gauges the transformation intensity of the reference image, with its potential range spanning from 0 to 1.

As the strength value increases, so does the noise added, with the number of denoising steps being contingent on the initial noise level.

all_images_strength = []

strength_values = [0.75 + i*0.05 for i in range(6)]

for strength in strength_values:
    image = pipe(prompt=prompt,
                 image=init_image,
                 strength=strength,
                 ).images[0]
    all_images_strength.append(image)

plot_images(all_images_strength, strength_values, "Strength")


How guidance_scale impacts image

The guidance scale is a parameter that affects how closely an image generated by a text prompt adheres to the input.

It balances creativity and adherence to the text prompt. The right value often depends on the specific use case and the complexity of the prompt.

A higher guidance scale value ensures greater adherence to the prompt, but it only sometimes leads to better results. While it helps the generated image follow the input, it can also reduce the diversity and quality of the output.

When using the guidance scale feature, it is vital to understand how it affects the generated image.

If the value is set at 1, the text prompt will have minimal impact on the output.

However, if the value is set at 20, the text prompt will be strictly followed, which can decrease image quality.

It is best to set the guidance scale value between 7 and 12 to achieve the most creative and artistic results. Using values up to 15 can also produce quality images with minimal artifacts.

It is recommended to use a guidance scale value between 7 and 9 and to consider increasing it if the generated image doesn’t align with the prompt.

Extreme values like 1 and 20 should be avoided.

For prompts that are more complex and have detailed specifications, it might be beneficial to start with a higher guidance scale between 12 and 16. This is because such prompts often require the generated image to capture finer details, which a higher guidance scale can facilitate.

Section source

all_images_guidance_scale = []

guidance_scale_values = [i * 1.5 for i in range(3, 10)]

for guidance_scale in guidance_scale_values:
    image = pipe(prompt=prompt,
                 image=init_image,
                 guidance_scale=guidance_scale).images[0]
    all_images_guidance_scale.append(image)

plot_images(all_images_guidance_scale, guidance_scale_values, "Guidance Scale")


The steps parameter

The steps parameter in diffusion models plays a crucial role in determining the quality of the generated image.

Diffusion models operate iteratively, starting with random noise generated from text input.

With each step, some noise is removed, enhancing the image’s quality.

The process halts once the specified number of steps is reached.

More denoising steps usually lead to a higher quality image at the expense of slower inference.

This parameter is modulated by strength.

# Loop over the num_inference_steps values
num_inference_steps_values = [i for i in range(30, 101, 10)]

all_images_num_inference_steps = []

for num_inference_steps in num_inference_steps_values:
    image = pipe(prompt=prompt,
                 image=init_image,
                 num_inference_steps=num_inference_steps).images[0]
    all_images_num_inference_steps.append(image)

plot_images(all_images_num_inference_steps, num_inference_steps_values, "Steps:")


Random combinations of the parameters

And to round out our discussion, we can

num_inference_steps_values = [i for i in range(30, 101, 10)]
guidance_scale_values = [i * 1.5 for i in range(3, 8)]
strength_values = [0.75 + i*0.05 for i in range(4)]

# Generate all combinations using itertools.product
all_combinations = list(itertools.product(strength_values, guidance_scale_values, num_inference_steps_values))

# Randomly select 9 combinations
selected_combinations = random.sample(all_combinations, 9)

# Generate images for each combination
all_images = []
for strength, guidance_scale, num_inference_steps in selected_combinations:
    image = pipe(prompt=prompt,
                 image=init_image,
                 strength=strength,
                 guidance_scale=guidance_scale,
                 num_inference_steps=num_inference_steps).images[0]
    all_images.append(image)

# Plot the images on a 3x3 grid
fig, axs = plt.subplots(3, 3, figsize=(15, 15))
for i, ax in enumerate(axs.ravel()):
    ax.imshow(all_images[i])
    title = f"Strength: {selected_combinations[i][0]}, Guidance: {selected_combinations[i][1]}, Steps: {selected_combinations[i][2]}"
    ax.set_title(title)
    ax.axis('off')
plt.tight_layout()
plt.show()


Conclusion

Img2img diffusion is a powerful tool in image processing.

With DeciDiffusion, we’ve enhanced its capabilities, making image transformations more efficient. Adapting DeciDiffusion for img2img tasks was a practical step, demonstrating the model’s versatility. It’s essential to grasp the role of parameters like strength and steps as they directly impact the output. These controls give users the ability to adjust results to their needs.

As we move forward, the combination of technology and user feedback will drive further improvements in image generation.

You May Also Like

YOLOv8 vs. YOLO-NAS Showdown: Exploring Advanced Object Detection 

How to Train YOLO-NAS with SuperGradients: A Step-by-Step Guide

Digital Eyes on the Road: Computer Vision in the Auto Industry

The latest deep learning insights, tips, and best practices delivered to your inbox.

Share
Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")