beprepared

No project description provided

These details have not been verified by PyPI

Project description

beprepared is an easy and efficient way to prepare high quality image datasets for diffusion model fine-tuning.

It falicates both human and machine-driven data prep work in a non-destructive environment that aggressively avoids duplicated effort, even as your workflow evolves.

It is the most efficient way for one person to prepare image data sets with thousands of images or more.

The Problem

A typical data prep workflow may look like this:

Scrape images from the web
Filter out images that are too small, or low quality
Manually select images that are relevant to your task
Auto-caption images using GPT-4o, JoyCaption, Llama 3.2, BLIP3, xGen-mm, or another VLM
Manually tag images with concepts or that are not understood by the VLM
Use an LLM prompt to compose tags + captions into a final caption, perhaps introducing additional rules
Perform caption augmentation by generating a few caption variations for each image
Create a training directory with foo.jpg, foo.txt, bar.jpg, bar.txt, ...

While it's possible to do this manually with the help of a bunch of python scripts that work on your .jpg and .txt files, it can be very cumbersome, especially as you iterate on the process or add new images. If you've done this a lot, you probably have graveyard of one-off scripts and datasets in various states of disarray.

Existing user interfaces for tasks that require a human in the loop, such as filtering scraped images, or manually tagging images are clumsy at best. In many cases they are built in sluggish frameworks like Gradio, lack keyboard shortcuts, are not mobile friendly, and are not non-destructive. We don't need beautiful interfaces to work on image datasets, but we do need efficient interfaces to move quickly through the work.

Humans are human. If the tooling is bad, the datasets will be bad. If the datasets are bad, the models trained with them will be bad. If the models are bad, the consumers of those models will make bad images, get frustrated, or give up. beprepared is designed to improve every stage of that process by making the data preparation process as efficient as possible.

Command Line Interface

beprepared provides a powerful command-line interface for running workflows and managing your workspace:

# Install from PyPI
$ pip install beprepared

# Run a workflow file
$ beprepared run workflow.py

# Execute a quick operation
$ beprepared exec "Load('images') >> FilterBySize(min_edge=512) >> Info"
$ beprepared exec "Load('images') >> AestheticScore >> Sort >> Take(100) >> Save"

# Manage the workspace database
$ beprepared db list                    # List all cached properties
$ beprepared db list "aesthetic_score*" # List specific properties
$ beprepared db clear "clip_embed*"     # Clear specific cached data

The CLI has three main commands:

`beprepared run` - Execute Workflow Files

Run complete workflow files that define your data preparation pipeline. For example:

$ beprepared run workflow.py

A typical workflow file looks like this:

# workflow.py
(
    Load("/path/to/dog_photos")
    >> FilterBySize(min_edge=512)    # Remove small images
    >> HumanFilter                   # Opens web UI for manual filtering
    >> ConvertFormat("JPEG")         # Standardize format
    >> JoyCaptionAlphaOne           # Auto-caption images
    >> HumanTag(tags=["labrador", "golden retriever", "poodle"])  # Manual tagging
    >> Save                          # Save to output/
)

See More Examples

When you run this workflow, beprepared will:

Launch a web interface when human input is needed (filtering/tagging)
Cache all operations to avoid repeating work
Save the processed dataset to the output directory

Each step is cached, so if you run the workflow again:

Previously filtered/tagged images won't need human review
Previously captioned images won't need recaptioning
Only new or modified images will be processed

When this workflow is executed, beprepared will first walk the /path/to/dog_photos directory to discover images, then ingest them into the workspace. Next, it will hit the HumanSelection step, and launch a web based UI with a single-task user interface focused solely on filtering images. Once all images have been filtered by a human, it can move on to the next step.

Each step along the way is cached, on an image-by-image basis. So if you run the workflow again, it will not need to re-present the web interface for selection, nor will it need to re-caption images that have already been captioned. During human-in-the-loop steps, changes are committed to the database on every click, so you will never lose work.

If you add new images to the source directory and run it again, only new images will be processed, using cached values for the others. If you change the workflow, the system will preserve as much work as possible, avoiding expensive human-in-the-loop operations, or ML model invocations that have already been run.

Significantly more complex workflows are possible, and beprepared is designed to support them. See the docs for more examples.

This is the main way to use beprepared.

`beprepared exec` - Quick Operations

Execute one-line operations without creating a workflow file:

$ beprepared exec "Load('raw_images') >> HumanFilter >> Save('filtered')"
$ beprepared exec "Load('dataset') >> JoyCaptionAlphaOne >> Save"
$ beprepared exec "Load('photos') >> NudeNet >> Info"

`beprepared db` - Manage the Workspace (advanced)

View and manage cached operations in your workspace:

$ beprepared db list                                 # List all properties
$ beprepared db list "llm*"                          # List specific properties
$ beprepared db clear                                # Clear all cached data
$ beprepared db clear "gemini*"                      # Clear specific cached data
$ beprepared db clear -d "mydomain" "humanfilter*"   # Clear specific cached data

Features

Flexible workflow definitions using a Python based DSL
Non-destructive workflow execution
Caching of intermediate results to avoid duplicate work
Automatic Captioning using JoyCaption, Llama 3.2, BLIP3, xGen-mm, GPT-4o, Gemini, Molmo, and Florence2
Human-in-the-loop filtering and tagging
Nudity detection using NudeNet
Improving captions using LLMs
Upscaling and downscaling images using PIL
Filtering images based on size
Computing CLIP embeddings for images
Generates JSON sidecar files for each image so that you can use the data in other tools or scripts
Precise and Fuzzy (CLIP-based) image deduplication
Aesthetic scoring
Collection operations like Map, Apply, Filter, Sort, Shuffle, Concat, and Random Sampling

Documentation

The full documentation is available at https://blucz.github.io/beprepared

Limitations

This project is used to prepare data sets for fine-tuning diffusion models on a single compute node. Currently it supports only one GPU, but multi-GPU support is planned.

It is not a goal of this project to help people preparing pre-training datasets with millions or billions of images. That would require a fundamentally more complex distributed architecture and would make it more difficult for the community to work with and improve this tool.

This project is currently developed on 24GB GPUs, and is not optimized for smaller GPUs. We welcome patches that make this software more friendly to smaller GPUs. Most likely this will involve tuning batch sizes or using quantized models.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.3

Feb 12, 2025

This version

0.2.2

Jan 3, 2025

0.1.2

Dec 31, 2024

0.1.1

Oct 23, 2024

0.1.0

Oct 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beprepared-0.2.2.tar.gz (214.7 kB view details)

Uploaded Jan 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

beprepared-0.2.2-py3-none-any.whl (231.0 kB view details)

Uploaded Jan 3, 2025 Python 3

File details

Details for the file beprepared-0.2.2.tar.gz.

File metadata

Download URL: beprepared-0.2.2.tar.gz
Upload date: Jan 3, 2025
Size: 214.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.8.0-51-generic

File hashes

Hashes for beprepared-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`9c1fe77514c8c4f8db43639ca85a67e4d9eacc357f9d8007a2505bcf127013d9`
MD5	`5569364aeebf752925b697a430f6dbea`
BLAKE2b-256	`1610ea710863d2ecb6bf8d0210c258eaa3b354a8245c32f3b96d0e5d5c5ecf42`

See more details on using hashes here.

File details

Details for the file beprepared-0.2.2-py3-none-any.whl.

File metadata

Download URL: beprepared-0.2.2-py3-none-any.whl
Upload date: Jan 3, 2025
Size: 231.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.8.0-51-generic

File hashes

Hashes for beprepared-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb750b0cdb7e485bcc524087b890a49df869a6ecca7950667138f10e8db90b1a`
MD5	`a4e243dd8fb6321de97cf6c47d683128`
BLAKE2b-256	`7e9d8e02635ec7786bf9c126519e3d5b247315a1a240611245a9524c776582ff`

See more details on using hashes here.

beprepared 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

The Problem

Command Line Interface

`beprepared run` - Execute Workflow Files

`beprepared exec` - Quick Operations

`beprepared db` - Manage the Workspace (advanced)

Features

Documentation

Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

beprepared 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

The Problem

Command Line Interface

beprepared run - Execute Workflow Files

beprepared exec - Quick Operations

beprepared db - Manage the Workspace (advanced)

Features

Documentation

Limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`beprepared run` - Execute Workflow Files

`beprepared exec` - Quick Operations

`beprepared db` - Manage the Workspace (advanced)