Skip to main content

paella - Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

Project description

Open In Colab LAION Blog Post

Paella

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.

collage

Update 12.04

Since the paper-release we worked intensively to bring Paella to a similar level as other state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer to people outside the field on a technical level. For example, a lot of models have codebases with many thousand lines of code, that make it very hard for people to dive into the code and easily understand it. And that is the contribution we are the proudest of with Paella. The training and sampling code for Paella is minimalistic and can be understood in a few minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire sampling code can be written in just 12 lines of code.

Please find all details about the model and how it was trained in our preprint paper on arxiv.


Code

We especially want to highlight the minimalistic amount of code that is necessary to run & train Paella. The training & sampling code can fit in under 140 lines of code. We hope to the field of generative AI and especially text-to-image more accessible to more people this way. In order to just understand the basic logic you can take a look at the main folder. For a more advanced training script, including mixed precision, distributed training, better logging and all conditioning models you can take a look at the distributed folder.

Models

Model Download Parameters Conditioning
Paella v3 Huggingface 1B (+1B prior) ByT5-XL, CLIP-H-Text, CLIP-H-Image

Sampling

Open In Colab

For sampling, you can just take a look at the sampling.ipynb notebook. :sunglasses:
Note: Since we condition on ByT5-XL, CLIP-H-Text, CLIP-H-Image, sampling with the model takes at least 30GB of RAM, unfortunately. We are hoping to use smaller conditioning models in the future.

Train your own Paella

Depending on how you want to train Paella, we provided code for running it on a single-GPU or for multiple-GPU / multi-node training. The main file for training is train.py. You can adjust all hyperparameters to your own needs. In the distributed training code we provided a webdataset dataloader, whereas in the single-GPU code you have to set your own dataloader. Make sure it returns a tuple of (images, captions) where images is a torch.Tensor of shape batch_size x channels x height x width and captions is a List of length batch_size. To start the training you can just run python3 paella.py for the single-GPU case and for the multi-GPU case we provided a slurm script for launching the training you can find here.

License

The model code and weights are released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

paella-0.0.1.dev1-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file paella-0.0.1.dev1-py3-none-any.whl.

File metadata

  • Download URL: paella-0.0.1.dev1-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for paella-0.0.1.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb3a61199ddf8527b320ce17a0db2ce5d1c3ce88434533460c070a1327c600ff
MD5 e09b21abebce5e4ade3840e23f8c72e4
BLAKE2b-256 a81985e827438e53bebdd71fec6465ebff335ccf1c9cc06c2c2e9a20de3dc46d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page