Skip to main content

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.

Project description

LLaVA-Caption

By David "Zanshinmu" Van de Ven zanshin.g1@gmail.com

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions. Optimized for and primarily tested on Apple Silicon Macs, though cross-platform compatibility is available.


Overview

LLaVA Caption was designed to solve a specific problem in AI training: when using generated images, the original prompts often contain elements that aren't present in the final images. Manual verification and captioning is time-consuming, but inaccurate captions make for bad training data. This tool provides higher quality captions than BLIP, with options ranging from basic processing to near-manual quality.

Llava Caption was built and tested on Apple Silicon. While cross-platform tools make it accessible to PCs, it hasn't been tested on Linux or Windows.


Available Models

MLXModel (Recommended)(Default)

  • Uses Qwen2-VL-7B-Instruct-8bit with Apple's MLX framework
  • Apple Silicon only
  • Fast processing with 16GB unified memory
  • Accuracy comparable to VisionModel
  • Requirements: Apple Silicon Mac, 16GB+ unified memory

VisionModel

  • Uses Llama 3.2 Vision via Ollama
  • High accuracy with moderate resource requirements
  • Excellent results with secondary caption generation
  • Ideal for training Flux/SD3
  • Requirements: 24GB RAM, GPU recommended

DualModel (Experimental)

  • Combines LLaVA 1.5 and Mixtral
  • Highest potential accuracy but resource-intensive
  • Supports distributed processing across machines
  • Currently experimental: may need optimization
  • Requirements: 64GB RAM, GPU strongly recommended

Additional Models

  • OLModel: Basic Ollama-based processing
  • HFModel: Hugging Face transformers-based processing (Note: MPS not supported on Apple Silicon)
  • LCPModel: Direct LLaMA C++ processing

Installation

Prerequisites

Installation with Poetry (Recommended)

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Install with Poetry
poetry install

# Activate virtual environment
poetry shell

# Verify installation
llava-caption --help

Alternative Installation with pip

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Verify installation
llava-caption --help

Usage

Basic Commands

# Basic usage with defaults
llava-caption /path/to/images/

# Specific model selection
llava-caption --model MLXModel /path/to/images/

# Direct captioning (no prompt comparison)
llava-caption --direct-caption /path/to/images/

Command Line Options

llava-caption [OPTIONS] DIRECTORY

Arguments:
  DIRECTORY                      Directory containing images

Model Selection:
  --model MODEL                  Model to use (default: MLXModel)
                                [env: LLAVA_PROCESSOR]

Processing Modes:
  --direct-caption              Enable direct captioning mode
  --secondary-caption           Enable secondary captioning
                                [env: SECONDARY_CAPTION]
  --no-preprocess              Disable text preprocessing
                                [env: PREPROCESSOR]

Model Parameters:
  --temperature FLOAT           Generation temperature (default: 0.0)
                                [env: TEMPERATURE]
  --gpu-layers INT             GPU layers (-1 for all)
                                [env: N_GPU_LAYERS]

Ollama Configuration:
  --ollama-address HOST:PORT    Ollama address (default: 127.0.0.1:11434)
                                [env: OLLAMA_REMOTEHOST]

Logging:
  --logging                     Enable detailed logging
  --sys-logging                Enable system logging
    
  

Example Usage Patterns

# MLXModel with direct captioning
llava-caption --model MLXModel --direct-caption /path/to/images/

# VisionModel with remote Ollama
llava-caption --model VisionModel --ollama-address 192.168.1.110:11434 /path/to/images/

# Secondary captioning with higher temperature
llava-caption --model VisionModel --secondary-caption --temperature 0.7 /path/to/images/

# Debug mode
llava-caption --logging --sys-logging /path/to/images/

Important Notes

PC Users

  • You may need to remove any mlx entries from requirements.txt to install successfully.

Model Downloads

  • Models are automatically downloaded via Hugging Face Hub or Ollama
  • Initial downloads may take time and significant disk space
  • Models are selected for optimal performance and resource usage

Resource Requirements

  • CPU Mode: Significant CPU and RAM usage, especially with HFModel
  • GPU Usage: Set TORCH_DEVICE="cuda:0" for Nvidia GPU support
  • Distributed Processing: Possible to run models across 2 hosts using DualModel

File Handling

  • Expects matching .png and .txt files in target directory
  • Existing text files will be overwritten with new captions
  • In direct caption mode, creates new .txt files for each image

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llava_caption-0.8.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llava_caption-0.8.0-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file llava_caption-0.8.0.tar.gz.

File metadata

  • Download URL: llava_caption-0.8.0.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.8.0.tar.gz
Algorithm Hash digest
SHA256 07af6e82ba26cad0769266f798cf5dca7818d6d54a295cd43805b8d3a2dba9fc
MD5 7e512407f38e57ddc998cd8633f16bf2
BLAKE2b-256 a63556beda376b55d3e11d2625db39fa26639a8e01954292f2b4be957376f896

See more details on using hashes here.

File details

Details for the file llava_caption-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: llava_caption-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 29.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22a04dc1c9c8a260c1c6341b03b0b5ddc35bc6b85e5d9b1b8f119379aabd90df
MD5 b6746ec21c196bf17ab35a6a597e2f58
BLAKE2b-256 093fedba56deb793d0ec2f17c08faf79f00d066a75542807740d80d3206a3517

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page