Text To Image Generation With Stable Diffusion XL

 

Text-to-image creators like DALL-E or Bing Image Creator have gained significant popularity in recent times, undoubtedly transforming the landscape of the Social Web. Numerous online images come with legally binding creative rights, either prohibiting their use entirely or requiring proper attribution. Text-to-image converters has the advantage of providing permissible creative content, freeing us from the constraints of usage restrictions or attribution requirements. AI image generators not only grant us the creative ownership of the images, but also empowers us to craft unique visuals for our exact content needs. Furthermore, these tools play a crucial role in fostering a semantic web by expanding visual-based semantic understanding by integrating AI-generated images and semantic metadata. This integration facilitates a more comprehensive, multi-modal understanding of the information and context.

However, these services are proprietary and operate exclusively online. In this blog post, we will delve into an open-source text-to-image creator called Stable Diffusion XL, which is developed by stability.ai. 

We will explore how to use this tool in our local environment and on Google Colab.
Optimal performance for text-to-image creators is achieved when running on a GPU. Unfortunately, these models often require substantial memory resources. We will leverage Google Colab and its GPU to address this issue. To get started, we must set the GPU as our hardware accelerator. First, open the "Runtime" menu in the Google Colab notebook and select "Change Runtime Type" from the dropdown menu.

Then, in the "Hardware Accelerator" section, choose "T4 GPU" and click "Save" to apply the changes.

After selecting our runtime, we must install the required packages for this creator to function. Execute the provided code within a cell in Google Colab. The exclamation mark denotes the execution of a command in the system rather than the Python environment.

!pip install diffusers
!pip install invisible_watermark transformers accelerate safetensors

Then, let's import the necessary modules.

from diffusers import DiffusionPipeline
import torch
import gc
base = None
refiner = None

Afterward, let's initialize our base creator model.

if not base:
  base = DiffusionPipeline.from_pretrained(
      "stabilityai/stable-diffusion-xl-base-1.0",
      torch_dtype=torch.float16,
      use_safetensors=True,
      variant="fp16"
  )
  #base.enable_model_cpu_offload()
  base.to('cuda')

We also have the option to redirect the output of the base creator model to the refiner for more precise results. However, it's important to note that this step is optional.

if not refiner:
  refiner = DiffusionPipeline.from_pretrained(
      "stabilityai/stable-diffusion-xl-refiner-1.0",
      text_encoder_2=base.text_encoder_2,
      vae=base.vae,
      torch_dtype=torch.float16,
      use_safetensors=True,
      variant="fp16",
  )
  #refiner.enable_model_cpu_offload()
  refiner.to('cuda')

Now, let's execute the base model. Feel free to input any text you desire into the "prompt" variable. For the purpose of this tutorial, we will generate an image depicting an orchestra of animals. Upon running this code, we observed a peak of 12.7GB in GPU memory usage.

prompt = "An orchestra of animals"

# This is for garbage collection
image = None
gc.collect()
torch.cuda.empty_cache()

# This is for repeatable results. Remove on productive usage.
# torch.manual_seed(123456)

image = base(prompt=prompt).images[0]
image.save("result.png")

If you wish to utilize the refiner as well, execute the following code instead. Upon running this code, we observed a peak of 14.7GB in GPU memory usage.

# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 40
high_noise_frac = 0.8

prompt = "An orchestra of animals"

# This is for garbage collection
image = None
gc.collect()
torch.cuda.empty_cache()

# This is for repeatable results. Remove on productive usage.
# torch.manual_seed(123456)

# run both experts
image = base(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images

# This is for garbage collection
gc.collect()
torch.cuda.empty_cache()

image = refiner(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=image,
).images[0]
image.save("refined.png")

As evident from the images below, the outcomes with the refiner appear to exhibit clearer results, albeit with some associated drawbacks in the GPU memory usage.

Image without refiner
Image with refiner

If you intend to execute this code on your local computer without a 16GB GPU, you can use the following code after installing Python and the necessary modules using "pip" in your local computer with the instructions above. When running this code without GPU utilization, we observed a peak of 16.7GB RAM usage.

from diffusers import DiffusionPipeline
import torch
import os
import gc

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    use_safetensors=True,
    variant="fp16"
)
pipe.to("cpu")

prompt = "An orchestra of animals"

# This is for garbage collection
image = None
gc.collect()

# This is for repeatable results. Remove on productive usage.
# torch.manual_seed(123456)

images = pipe(prompt=prompt).images

images[0].save("result.jpg")

The image below was generated after the completion of the code execution. However, it's important to note that the execution time for this code was approximately 30 minutes, in contrast to the earlier code above, which ran within the range of 30-60 seconds after DiffusionPipeline downloaded and imported the model.

In conclusion, leveraging AI-generated images provides significant advantages for content creators. While robust tools exist for creating AI-generated images, it's important to note that these tools can be executed on local computers if enough resources are available.

Comments

Popular posts from this blog

Reverse Search Engines

ChatGPT in Your Computer: GPT4All and Local Language Models