Introduction
DeciDiffusion, developed by Deci, is a robust text-to-image latent diffusion model.
With 1.02 billion parameters, it surpasses the performance of models like Stable Diffusion v1.5, delivering high-quality images efficiently. When integrated with Deci’s Inference SDK, Infery, DeciDiffusion’s speed is optimized, producing results quickly on NVIDIA A10G GPUs. The model’s architecture, driven by Deci’s Neural Architecture Search engine, AutoNAC, is tailored for optimal inference efficiency.
While DeciDiffusion excels in text-to-image generation, there’s a clear opportunity to expand its capabilities to image-to-image translation—a domain with vast applications from design to medical imaging.
Adapting DeciDiffusion can meet the industry’s need for fast and precise image-to-image models.
In this guide, I’ll walk you through the process of adapting DeciDiffusion for image-to-image tasks, highlighting the necessary architectural changes and practical steps.
The Science Behind Image-to-Image Translation
Image-to-image translation is the process of taking an input image and transforming it into a different visual representation.
At its core, it’s mapping the content of one image to another while retaining the essence of the original. This can range from tasks like turning a sketch into colourful artwork, changing day scenes to night, or even converting satellite images into detailed maps.
However, this translation has its challenges.
The model must discern intricate details, understand context, and produce visually appealing and accurate outputs. Minor discrepancies can lead to significant inaccuracies in the translated image.
That’s why a robust model is crucial.
Diving into SDEdit: HuggingFace’s Diffusers Library Technique
SDEdit, or Stochastic Differential Editing, is a the technique for image-to-image diffusion that HuggingFace’s Diffusers library uses.
SDEdit leverages the power of stochastic differential equations (SDEs) for image synthesis.
SDEdit synthesizes images by iterative denoising through an SDE.
Here’s how it works: Given an input image with user guidance, such as a sketch or outline, SDEdit introduces noise to this input. It then employs the SDE to denoise the resulting image, enhancing its realism. This iterative denoising mechanism allows SDEdit to balance staying true to the user’s input and producing a realistic output.
But why opt for SDEdit over other methods?
Traditional Generative Adversarial Networks (GANs) often necessitate additional training data or specialized loss functions for individual applications.
SDEdit sidesteps these requirements.
It doesn’t demand task-specific training or inversions, making it a versatile tool for various image synthesis tasks.
In practical terms, with SDEdit, developers and artists can achieve a harmonious blend of realism and faithfulness in their generated images without the overhead of extensive training or customization.
%%capture !pip install diffusers transformers accelerate
import os import time from io import BytesIO import matplotlib.pyplot as plt import requests import random import itertools import torch from IPython.core.display import display, HTML from matplotlib import pyplot as plt from PIL import Image from tqdm.notebook import tqdm from transformers import CLIPImageProcessor, CLIPTokenizer, CLIPTextModel from diffusers import StableDiffusionImg2ImgPipeline, StableDiffusionPipeline from diffusers.models import AutoencoderKL, UNet2DConditionModel from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker from diffusers.schedulers import DDIMScheduler
Instantiating and Adapting DeciDiffusion for Image-to-Image Tasks
Initial Instantiation of DeciDiffusion:
Before adapting DeciDiffusion for image-to-image tasks, you first instantiated an instance of DeciDiffusion using the StableDiffusionPipeline
In this step:
- The
from_pretrained
method is used to load theDeciDiffusion-v1-0 model
.
- The
custom_pipeline
argument specifies the custom pipeline to be used, which in this case is also set toDeci/DeciDiffusion-v1-0
. - The data type for PyTorch tensors is explicitly set to
torch.float16
for optimized performance.
deci_diffusion_pipeline = StableDiffusionPipeline.from_pretrained('Deci/DeciDiffusion-v1-0', custom_pipeline='Deci/DeciDiffusion-v1-0', torch_dtype=torch.float16 )
The Specialized UNet-NAS in DeciDiffusion
U-Net is a popular convolutional neural network architecture primarily used for biomedical image segmentation.
However, DeciDiffusion takes this a step further by utilizing a specialized version called UNet-NAS
(Neural Architecture Search). This variant of U-Net is designed to automatically search for the best network architecture, optimizing its performance for specific tasks.
Instantiating UNet-NAS for DeciDiffusion:
To harness the capabilities of UNet-NAS within DeciDiffusion, you instantiated it from the deci_diffusion_pipeline
.
In this instantiation:
- The from_pretrained method is employed to load the UNet-NAS model specific to
Deci/DeciDiffusion-v1-0
. - The subfolder argument points to
flexible_unet
, indicating the location where the specialized U-Net model is stored. - Similar to the earlier instantiation, the data type for PyTorch tensors is set to
torch.float16
for optimized computational performance.
Why UNet-NAS?
The adoption of UNet-NAS in DeciDiffusion offers several advantages:
- Optimized Architecture: UNet-NAS automatically searches for the most efficient network structure, ensuring optimal performance for image-to-image tasks.
- Flexibility: Given its neural architecture search capability, UNet-NAS can be tailored to various tasks, making it a versatile choice for diverse image processing challenges.
- Enhanced Performance: By leveraging a network structure that’s specifically optimized for the task at hand, UNet-NAS can deliver superior results compared to standard U-Net architectures.
unet_nas = deci_diffusion_pipeline.unet.from_pretrained('Deci/DeciDiffusion-v1-0', subfolder='flexible_unet', torch_dtype=torch.float16)
The config attributes {'act_fn': 'silu', 'addition_embed_type': None, 'addition_embed_type_num_heads': 64, 'addition_time_embed_dim': None, 'attention_head_dim': 8, 'attention_type': 'default', 'block_out_channels': [320, 640, 1280, 1280], 'center_input_sample': False, 'class_embed_type': None, 'class_embeddings_concat': False, 'configurations': {'add_downsample': [True, True, True], 'add_upsample': [True, True, False], 'add_upsample_mid_block': True, 'cross_attention_dim': 768, 'down_blocks_in_channels': [320, 320, 640], 'down_blocks_num_attentions': [0, 1, 3], 'down_blocks_num_resnets': [2, 2, 1], 'down_blocks_out_channels': [320, 640, 1280], 'mid_num_attentions': 2, 'mid_num_resnets': 4, 'mix_block_in_forward': True, 'num_attention_heads': 8, 'prev_output_channels': [1280, 1280, 640], 'resnet_act_fn': 'silu', 'resnet_eps': 1e-05, 'sample_size': 64, 'temb_dim': 1280, 'up_blocks_num_attentions': [6, 3, 0], 'up_blocks_num_resnets': [2, 3, 3]}, 'conv_in_kernel': 3, 'conv_out_kernel': 3, 'cross_attention_dim': 768, 'cross_attention_norm': None, 'down_block_types': ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D'], 'downsample_padding': 1, 'dual_cross_attention': False, 'encoder_hid_dim': None, 'encoder_hid_dim_type': None, 'flip_sin_to_cos': True, 'freq_shift': 0, 'in_channels': 4, 'layers_per_block': 2, 'mid_block_only_cross_attention': None, 'mid_block_scale_factor': 1, 'mid_block_type': 'UNetMidBlock2DCrossAttn', 'norm_eps': 1e-05, 'norm_num_groups': 32, 'num_attention_heads': None, 'num_class_embeds': None, 'only_cross_attention': False, 'out_channels': 4, 'projection_class_embeddings_input_dim': None, 'resnet_out_scale_factor': 1.0, 'resnet_skip_time_act': False, 'resnet_time_scale_shift': 'default', 'sample_size': 64, 'time_cond_proj_dim': None, 'time_embedding_act_fn': None, 'time_embedding_dim': None, 'time_embedding_type': 'positional', 'timestep_post_act': None, 'transformer_layers_per_block': 1, 'up_block_types': ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D'], 'upcast_attention': False, 'use_linear_projection': False} were passed to FlexibleUNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Preparing DeciDiffusion for Image-to-Image Tasks Using StableDiffusionImg2ImgPipeline
What is the StableDiffusionImg2ImgPipeline
?
The StableDiffusionImg2ImgPipeline
is a specialized pipeline provided by HuggingFace’s Diffusers library.
It’s designed to apply stable diffusion techniques to images, allowing for sophisticated image transformations based on the principles of stochastic differential editing.
How to Use StableDiffusionImg2ImgPipeline
to Adapt DeciDiffusion for Image-to-Image Tasks
To transform DeciDiffusion into an image-to-image model, you can create a subclass of the StableDiffusionImg2ImgPipeline
.
This subclass, DeciDiffusionPipeline_img2img
, will inherit the foundational capabilities of the parent pipeline but will be tailored with custom functionalities specific to DeciDiffusion.
class DeciDiffusionPipeline_img2img(StableDiffusionImg2ImgPipeline): deci_default_number_of_iterations = 30 deci_default_guidance_rescale = 0.7 def __init__(self, vae: AutoencoderKL, text_encoder: CLIPTextModel, tokenizer: CLIPTokenizer, unet: UNet2DConditionModel, scheduler: DDIMScheduler, safety_checker: StableDiffusionSafetyChecker, feature_extractor: CLIPImageProcessor, requires_safety_checker: bool = True ):
Components of the DeciDiffusionPipeline_img2img
vae
(AutoencoderKL
): A Variational Auto-Encoder (VAE) model that encodes and decodes images to and from latent representations. VAEs are powerful tools for generating new images by sampling from the latent space.
text_encoder
(CLIPTextModel
): A frozen text-encoder designed to convert textual descriptions into embeddings that can be used in conjunction with image data.tokenizer
(CLIPTokenizer
): TheCLIPTokenizer
is used to tokenize text, converting human-readable text into a format that theCLIPTextModel
can understand and process.unet
(UNet2DConditionModel
): TheUNet2DConditionModel
is a neural network architecture designed for image denoising. It’s structured to process the encoded image latents and refine them, enhancing the quality of the generated images.scheduler
(SchedulerMixin
): This component determines how the unet denoises the encoded image latents. It can be one of several types, includingDDIMScheduler
,LMSDiscreteScheduler
, orPNDMScheduler
.safety_checker
(StableDiffusionSafetyChecker
): A classification module that evaluates generated images to determine if they could be considered offensive or harmful. It’s an essential component to ensure the responsible generation of images. More details about potential harms can be found in the model card.feature_extractor
(CLIPImageProcessor
): TheCLIPImageProcessor
extracts features from the generated images. These features are then used as inputs to thesafety_checker
, ensuring that the generated images adhere to safety standards.
class DeciDiffusionPipeline_img2img(StableDiffusionImg2ImgPipeline): deci_default_number_of_iterations = 30 deci_default_guidance_rescale = 0.7 def __init__(self, vae: AutoencoderKL, text_encoder: CLIPTextModel, tokenizer: CLIPTokenizer, unet: UNet2DConditionModel, scheduler: DDIMScheduler, safety_checker: StableDiffusionSafetyChecker, feature_extractor: CLIPImageProcessor, requires_safety_checker: bool = True ): super().__init__(vae=vae, text_encoder=text_encoder, tokenizer=tokenizer, unet=unet, scheduler=scheduler, safety_checker=safety_checker, feature_extractor=feature_extractor, requires_safety_checker=requires_safety_checker ) self.register_modules(vae=vae, text_encoder=text_encoder, tokenizer=tokenizer, unet=unet, scheduler=scheduler, safety_checker=safety_checker, feature_extractor=feature_extractor) def __call__(self, *args, **kwargs): if "num_inference_steps" not in kwargs: kwargs.update({'num_inference_steps': self.deci_default_number_of_iterations}) return super().__call__(*args, **kwargs)
Adapting DeciDiffusion for Image-to-Image Tasks
With the deci_diffusion_pipeline
instance in place, you then utilized its components to instantiate the DeciDiffusionPipeline_img2img
, a subclass of StableDiffusionImg2ImgPipeline
Here:
- Components like
vae
,text_encoder
,tokenizer
,scheduler
,safety_checker
, andfeature_extractor
are directly sourced from the previously instantiateddeci_diffusion_pipeline
. - The
unet
component is set to a custom unet_nas. - The safety checker requirement is turned off by setting
requires_safety_checker
to False.
This approach ensures that DeciDiffusion is seamlessly adapted to handle image-to-image tasks, leveraging the foundational capabilities of the StableDiffusionImg2ImgPipeline
and customizing it to meet specific requirements.
pipe = DeciDiffusionPipeline_img2img( vae = deci_diffusion_pipeline.vae, text_encoder = deci_diffusion_pipeline.text_encoder, tokenizer = deci_diffusion_pipeline.tokenizer, unet = unet_nas, scheduler = deci_diffusion_pipeline.scheduler, safety_checker = deci_diffusion_pipeline.safety_checker, feature_extractor = deci_diffusion_pipeline.feature_extractor, requires_safety_checker = False ).to('cuda')
Let’s see what we can transform the following image into
# download image url = "https://www.easypeasyandfun.com/wp-content/uploads/2021/07/Step-8-1.png" response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image
prompt = "A cute baby dragon with blue and green scales, artstation painting"
# helper function to plot def plot_images(images, values, title): """ Plot a series of images side by side with their respective titles. Parameters: - images (list): A list of images to be plotted. - values (list): A list of values corresponding to each image, used for the title. - title (str): The main title for the images, used as a prefix for each image's title. Returns: None. Displays the images and saves the plot as a PNG file. """ # Create a subplot with 1 row and as many columns as there are images/values fig, axs = plt.subplots(1, len(values), figsize=(15, 5)) # Loop through each axis and image to display them for i, ax in enumerate(axs): ax.imshow(images[i]) # Display the image on the axis ax.set_title(f"{title}: {values[i]}") # Set the title for the image using the provided value ax.axis('off') # Turn off the axis for a cleaner look plt.tight_layout() # Adjust the layout to minimize overlaps plt.savefig(f"{title}_experiments.png") # Save the combined plot as a PNG file plt.show() # Display the plot
Understanding geneartion parameters for Img2Img pipelines
There are a few important parametrs that you can use to control the result of the image-to-image pipeline
strength
guidance_scale
num_inference_steps
Asessing impact of strength
on generated image.
In the Image to Image mode of Stable Diffusion, the strength
parameter plays a pivotal role in determining the noise level introduced to the starting image during its generation.
- Near 0: Setting
strength
close to 0 yields an outcome nearly identical to the original image. - At 1: A
strength
value of 1 introduces the maximum noise, making the image vastly different from the original. At this setting, the referenceimage
is essentially disregarded, and the denoising process runs for the full count of iterations specified innum_inference_steps
.
For a harmonious blend of the original image’s elements and new concepts, it’s advised to start with strength
values between 0.4 and 0.6 and experiment from there. This range offers a balanced transformation, ensuring both familiarity and novelty in the output.
The strength parameter essentially gauges the transformation intensity of the reference image, with its potential range spanning from 0 to 1.
As the strength value increases, so does the noise added, with the number of denoising steps being contingent on the initial noise level.
all_images_strength = [] strength_values = [0.75 + i*0.05 for i in range(6)] for strength in strength_values: image = pipe(prompt=prompt, image=init_image, strength=strength, ).images[0] all_images_strength.append(image) plot_images(all_images_strength, strength_values, "Strength")
How guidance_scale
impacts image
The guidance scale is a parameter that affects how closely an image generated by a text prompt adheres to the input.
It balances creativity and adherence to the text prompt. The right value often depends on the specific use case and the complexity of the prompt.
A higher guidance scale value ensures greater adherence to the prompt, but it only sometimes leads to better results. While it helps the generated image follow the input, it can also reduce the diversity and quality of the output.
When using the guidance scale feature, it is vital to understand how it affects the generated image.
If the value is set at 1, the text prompt will have minimal impact on the output.
However, if the value is set at 20, the text prompt will be strictly followed, which can decrease image quality.
It is best to set the guidance scale value between 7 and 12 to achieve the most creative and artistic results. Using values up to 15 can also produce quality images with minimal artifacts.
It is recommended to use a guidance scale value between 7 and 9 and to consider increasing it if the generated image doesn’t align with the prompt.
Extreme values like 1 and 20 should be avoided.
For prompts that are more complex and have detailed specifications, it might be beneficial to start with a higher guidance scale between 12 and 16. This is because such prompts often require the generated image to capture finer details, which a higher guidance scale can facilitate.
all_images_guidance_scale = [] guidance_scale_values = [i * 1.5 for i in range(3, 10)] for guidance_scale in guidance_scale_values: image = pipe(prompt=prompt, image=init_image, guidance_scale=guidance_scale).images[0] all_images_guidance_scale.append(image) plot_images(all_images_guidance_scale, guidance_scale_values, "Guidance Scale")
The steps
parameter
The steps
parameter in diffusion models plays a crucial role in determining the quality of the generated image.
Diffusion models operate iteratively, starting with random noise generated from text input.
With each step, some noise is removed, enhancing the image’s quality.
The process halts once the specified number of steps is reached.
More denoising steps usually lead to a higher quality image at the expense of slower inference.
This parameter is modulated by strength
.
# Loop over the num_inference_steps values num_inference_steps_values = [i for i in range(30, 101, 10)] all_images_num_inference_steps = [] for num_inference_steps in num_inference_steps_values: image = pipe(prompt=prompt, image=init_image, num_inference_steps=num_inference_steps).images[0] all_images_num_inference_steps.append(image) plot_images(all_images_num_inference_steps, num_inference_steps_values, "Steps:")
Random combinations of the parameters
And to round out our discussion, we can
num_inference_steps_values = [i for i in range(30, 101, 10)] guidance_scale_values = [i * 1.5 for i in range(3, 8)] strength_values = [0.75 + i*0.05 for i in range(4)] # Generate all combinations using itertools.product all_combinations = list(itertools.product(strength_values, guidance_scale_values, num_inference_steps_values)) # Randomly select 9 combinations selected_combinations = random.sample(all_combinations, 9) # Generate images for each combination all_images = [] for strength, guidance_scale, num_inference_steps in selected_combinations: image = pipe(prompt=prompt, image=init_image, strength=strength, guidance_scale=guidance_scale, num_inference_steps=num_inference_steps).images[0] all_images.append(image) # Plot the images on a 3x3 grid fig, axs = plt.subplots(3, 3, figsize=(15, 15)) for i, ax in enumerate(axs.ravel()): ax.imshow(all_images[i]) title = f"Strength: {selected_combinations[i][0]}, Guidance: {selected_combinations[i][1]}, Steps: {selected_combinations[i][2]}" ax.set_title(title) ax.axis('off') plt.tight_layout() plt.show()
Conclusion
Img2img diffusion is a powerful tool in image processing.
With DeciDiffusion, we’ve enhanced its capabilities, making image transformations more efficient. Adapting DeciDiffusion for img2img tasks was a practical step, demonstrating the model’s versatility. It’s essential to grasp the role of parameters like strength and steps as they directly impact the output. These controls give users the ability to adjust results to their needs.
As we move forward, the combination of technology and user feedback will drive further improvements in image generation.