Skip to main content

utAutomatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.

Project description

LLaVA-Caption

By David "Zanshinmu" Van de Ven zanshin.g1@gmail.com

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.


Overview

LLaVA Caption was designed to solve a specific problem in AI training: when using generated images, the original prompts often contain elements that aren't present in the final images. Manual verification and captioning is time-consuming, but inaccurate captions make for bad training data. This tool provides higher quality captions than BLIP, with options ranging from basic processing to near-manual quality.

Llava Caption was built and tested on Apple Silicon. While cross-platform tools make it accessible to PCs, it hasn’t been tested on Linux or Windows.


Available Models

MLXModel (Recommended)(Default)

  • Uses Qwen2-VL-7B-Instruct-8bit with Apple's MLX framework
  • Apple Silicon only
  • Fast processing with 16GB unified memory
  • Accuracy comparable to VisionModel
  • Requirements: Apple Silicon Mac, 16GB+ unified memory

VisionModel

  • Uses Llama 3.2 Vision via Ollama
  • High accuracy with moderate resource requirements
  • Excellent results with secondary caption generation
  • Ideal for training Flux/SD3
  • Requirements: 24GB RAM, GPU recommended

DualModel (Experimental)

  • Combines LLaVA 1.5 and Mixtral
  • Highest potential accuracy but resource-intensive
  • Supports distributed processing across machines
  • Currently experimental: may need optimization
  • Requirements: 64GB RAM, GPU strongly recommended

Additional Models

  • OLModel: Basic Ollama-based processing
  • HFModel: Hugging Face transformers-based processing (Note: MPS not supported on Apple Silicon)
  • LCPModel: Direct LLaMA C++ processing

Installation

Prerequisites

Quick Start

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Verify installation
llava-caption --help

Usage

Basic Commands

# Basic usage with defaults
llava-caption /path/to/images/

# Specific model selection
llava-caption --model MLXModel /path/to/images/

# Direct captioning (no prompt comparison)
llava-caption --direct-caption /path/to/images/

Command Line Options

llava-caption [OPTIONS] DIRECTORY

Arguments:
  DIRECTORY                      Directory containing images

Model Selection:
  --model MODEL                  Model to use (default: MLXModel)
                                [env: LLAVA_PROCESSOR]

Processing Modes:
  --direct-caption              Enable direct captioning mode
  --secondary-caption           Enable secondary captioning
                                [env: SECONDARY_CAPTION]
  --no-preprocess              Disable text preprocessing
                                [env: PREPROCESSOR]

Model Parameters:
  --temperature FLOAT           Generation temperature (default: 0.0)
                                [env: TEMPERATURE]
  --gpu-layers INT             GPU layers (-1 for all)
                                [env: N_GPU_LAYERS]

Ollama Configuration:
  --ollama-address HOST:PORT    Ollama address (default: 127.0.0.1:11434)
                                [env: OLLAMA_REMOTEHOST]

Logging:
  --logging                     Enable detailed logging
  --sys-logging                Enable system logging
    
  

Example Usage Patterns

# MLXModel with direct captioning
llava-caption --model MLXModel --direct-caption /path/to/images/

# VisionModel with remote Ollama
llava-caption --model VisionModel --ollama-address 192.168.1.110:11434 /path/to/images/

# Secondary captioning with higher temperature
llava-caption --model VisionModel --secondary-caption --temperature 0.7 /path/to/images/

# Debug mode
llava-caption --logging --sys-logging /path/to/images/

Important Notes

PC Users

  • You may need to remove any mlx entries from requirements.txt to install successfully.

Model Downloads

  • Models are automatically downloaded via Hugging Face Hub or Ollama
  • Initial downloads may take time and significant disk space
  • Models are selected for optimal performance and resource usage

Resource Requirements

  • CPU Mode: Significant CPU and RAM usage, especially with HFModel
  • GPU Usage: Set TORCH_DEVICE="cuda:0" for Nvidia GPU support
  • Distributed Processing: Possible to run models across 2 hosts using DualModel

File Handling

  • Expects matching .png and .txt files in target directory
  • Existing text files will be overwritten with new captions
  • In direct caption mode, creates new .txt files for each image

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llava_caption-0.7.2.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llava_caption-0.7.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file llava_caption-0.7.2.tar.gz.

File metadata

  • Download URL: llava_caption-0.7.2.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.7.2.tar.gz
Algorithm Hash digest
SHA256 1edddbd0b475f87eee595fd531f92edfe0b7c049cf82736728e8a8a722e56292
MD5 ad71f46cedebb1d814f3860875794bc7
BLAKE2b-256 25f4412f3518aca167bfb944ca06eb3dffdf61840c0414dd7749aacbe3d88a0b

See more details on using hashes here.

File details

Details for the file llava_caption-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: llava_caption-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a10b1c63abb89dd183023c1c464bf74efe0d8d0e61737180a14643a55d904fa
MD5 3b43a416b4ad451cbf9842f264fd6143
BLAKE2b-256 4dcfc2eba27aed4c398bae5ce1c58d77e06191a63331056ddf4973825b1ef248

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page