Description
DeciDiffusion 1.0 is an 820 million parameter text-to-image latent diffusion model trained on the LAION-v2 dataset and fine-tuned on the LAION-ART dataset.
Publishers
Deci AI Team
Submitted Version
September 13, 2023
Latest Version
N/A
Size
N/A
DeciDiffusion 1.0 is an 820 million parameter text-to-image latent diffusion model trained on the LAION-v2 dataset and fine-tuned on the LAION-ART dataset. Advanced training techniques were used to speed up training, improve training performance, and achieve better inference quality.
DeciDiffusion 1.0 is a diffusion-based text-to-image generation model. While it maintains foundational architecture elements from Stable Diffusion, such as the Variational Autoencoder (VAE) and CLIP’s pre-trained Text Encoder, DeciDiffusion introduces significant enhancements. The primary innovation is the substitution of U-Net with the more efficient U-Net-NAS, a design pioneered by Deci. This novel component streamlines the model by reducing the number of parameters, leading to superior computational efficiency.
Misuse, Malicious Use, and Out-of-Scope Use
The model must not be employed to deliberately produce or spread images that foster hostile or unwelcoming settings for individuals. This encompasses generating visuals that might be predictably upsetting, distressing, or inappropriate, as well as content that perpetuates existing or historical biases.
The model isn’t designed to produce accurate or truthful depictions of people or events. Thus, using it for such purposes exceeds its intended capabilities.
Misusing the model to produce content that harms or maligns individuals is strictly discouraged. Such misuses include, but aren’t limited to:
The model has certain limitations and may not function optimally in the following scenarios:
The remarkable abilities of image generation models can unintentionally amplify societal biases. DeciDiffusion was mainly trained on subsets of LAION-v2, focused on English descriptions. Consequently, non-English communities and cultures might be underrepresented, leading to a bias towards white and western norms. Outputs from non-English prompts are notably less accurate. Given these biases, users should approach DeciDiffusion with discretion, regardless of input.
Training Procedure
The model was trained in 4 phases:
DeciDiffusion 1.0 was trained to be sample efficient, i.e. to produce high-quality results using fewer diffusion timesteps during inference. The following training techniques were used to that end:
Training from 870k steps at resolution 512×512 on the same dataset to learn more fine-detailed information.
The following techniques were used to shorten training time:
Using precomputed VAE and CLIP latents
Using EMA only in the last phase of training
On average, DeciDiffusion’s generated images after 30 iterations achieve comparable Frechet Inception Distance (FID) scores to those generated by Stable Diffusion 1.5 after 50 iterations. However, many recent articles question the reliability of FID scores, warning that FID results tend to be fragile, that they are inconsistent with human judgments on MNIST and subjective evaluation, that they are statistically biased, and that they give better scores to memorization of the dataset than to generalization beyond it.
Given this skepticism about FID’s reliability, we chose to assess DeciDiffusion 1.0’s sample efficiency by performing a user study against Stable Diffusion 1.5. Our source for image captions was the PartiPrompts benchmark, which was introduced to compare large text-to-image models on various challenging prompts.
For our study we chose 10 random prompts and for each prompt generated 3 images by Stable Diffusion 1.5 configured to run for 50 iterations and 3 images by DeciDiffusion configured to run for 30 iterations.
We then presented 30 side by side comparisons to a group of professionals, who voted based on adherence to the prompt and aesthetic value.
According to the results, DeciDiffusion at 30 iterations exhibits an edge in aesthetics, but when it comes to prompt alignment, it’s on par with Stable Diffusion at 50 iterations.
The following table summarizes our survey results:
Answer | Better image aesthetics | Better prompt alignment |
---|---|---|
DeciDiffusion 1.0 30 Iterations | 41.1% | 20.8% |
StableDiffusion v1.5 50 Iterations | 30.5% | 18.8% |
On Par | 26.3% | 39.1% |
Neither | 2.1% | 11.4% |
The following tables provide an image latency comparison between DeciDiffusion 1.0 and Stable Diffusion v1.5.
DeciDiffusion 1.0 vs. Stable Diffusion v1.5 at FP16 precision
Inference Tool + Iterations | DeciDiffusion 1.0 on A10 (seconds/image) | Stable Diffusion v1.5 on A10 (seconds/image) |
---|---|---|
Pytorch 50 Iterations | 2.11 | 2.95 |
Infery 50 Iterations | 1.55 | 2.08 |
Pytorch 35 Iterations | 1.52 | – |
Infery 35 Iterations | 1.07 | – |
Pytorch 30 Iterations | 1.29 | – |
Infery 30 Iterations | 0.98 | – |
You can use the DeciDiffusion model to do text generation. Below, see how you can easily load the DeciDiffusion model.
# pip install diffusers transformers torch from diffusers import StableDiffusionPipeline import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' checkpoint = "Deci/DeciDiffusion-v1-0" pipeline = StableDiffusionPipeline.from_pretrained(checkpoint, custom_pipeline=checkpoint, torch_dtype=torch.float16) pipeline.unet = pipeline.unet.from_pretrained(checkpoint, subfolder='flexible_unet', torch_dtype=torch.float16) pipeline = pipeline.to(device) img = pipeline(prompt=['A photo of an astronaut riding a horse on Mars']).images[0]
We’d love your feedback on the information presented in this card. Please also share any unexpected results.
For a short meeting with the SuperGradients team, use this link and choose your preferred time.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")