utAutomatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.

These details have not been verified by PyPI

Project description

LLaVA-Caption

By David "Zanshinmu" Van de Ven zanshin.g1@gmail.com

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.

Overview

LLaVA Caption was designed to solve a specific problem in AI training: when using generated images, the original prompts often contain elements that aren't present in the final images. Manual verification and captioning is time-consuming, but inaccurate captions make for bad training data. This tool provides higher quality captions than BLIP, with options ranging from basic processing to near-manual quality.

Llava Caption was built and tested on Apple Silicon. While cross-platform tools make it accessible to PCs, it hasn’t been tested on Linux or Windows.

Available Models

MLXModel (Recommended)(Default)

Uses Qwen2-VL-7B-Instruct-8bit with Apple's MLX framework
Apple Silicon only
Fast processing with 16GB unified memory
Accuracy comparable to VisionModel
Requirements: Apple Silicon Mac, 16GB+ unified memory

VisionModel

Uses Llama 3.2 Vision via Ollama
High accuracy with moderate resource requirements
Excellent results with secondary caption generation
Ideal for training Flux/SD3
Requirements: 24GB RAM, GPU recommended

DualModel (Experimental)

Combines LLaVA 1.5 and Mixtral
Highest potential accuracy but resource-intensive
Supports distributed processing across machines
Currently experimental: may need optimization
Requirements: 64GB RAM, GPU strongly recommended

Additional Models

OLModel: Basic Ollama-based processing
HFModel: Hugging Face transformers-based processing (Note: MPS not supported on Apple Silicon)
LCPModel: Direct LLaMA C++ processing

Installation

Prerequisites

Python 3.10 (python.org)
Git (git-scm.com)
Ollama (ollama.com/download) - Required for Ollama-based models

Quick Start

# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Verify installation
llava-caption --help

Usage

Basic Commands

# Basic usage with defaults
llava-caption /path/to/images/

# Specific model selection
llava-caption --model MLXModel /path/to/images/

# Direct captioning (no prompt comparison)
llava-caption --direct-caption /path/to/images/

Command Line Options

llava-caption [OPTIONS] DIRECTORY

Arguments:
  DIRECTORY                      Directory containing images

Model Selection:
  --model MODEL                  Model to use (default: MLXModel)
                                [env: LLAVA_PROCESSOR]

Processing Modes:
  --direct-caption              Enable direct captioning mode
  --secondary-caption           Enable secondary captioning
                                [env: SECONDARY_CAPTION]
  --no-preprocess              Disable text preprocessing
                                [env: PREPROCESSOR]

Model Parameters:
  --temperature FLOAT           Generation temperature (default: 0.0)
                                [env: TEMPERATURE]
  --gpu-layers INT             GPU layers (-1 for all)
                                [env: N_GPU_LAYERS]

Ollama Configuration:
  --ollama-address HOST:PORT    Ollama address (default: 127.0.0.1:11434)
                                [env: OLLAMA_REMOTEHOST]

Logging:
  --logging                     Enable detailed logging
  --sys-logging                Enable system logging

Example Usage Patterns

# MLXModel with direct captioning
llava-caption --model MLXModel --direct-caption /path/to/images/

# VisionModel with remote Ollama
llava-caption --model VisionModel --ollama-address 192.168.1.110:11434 /path/to/images/

# Secondary captioning with higher temperature
llava-caption --model VisionModel --secondary-caption --temperature 0.7 /path/to/images/

# Debug mode
llava-caption --logging --sys-logging /path/to/images/

Important Notes

PC Users

You may need to remove any mlx entries from requirements.txt to install successfully.

Model Downloads

Models are automatically downloaded via Hugging Face Hub or Ollama
Initial downloads may take time and significant disk space
Models are selected for optimal performance and resource usage

Resource Requirements

CPU Mode: Significant CPU and RAM usage, especially with HFModel
GPU Usage: Set TORCH_DEVICE="cuda:0" for Nvidia GPU support
Distributed Processing: Possible to run models across 2 hosts using DualModel

File Handling

Expects matching .png and .txt files in target directory
Existing text files will be overwritten with new captions
In direct caption mode, creates new .txt files for each image

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.8.0

Jan 31, 2025

This version

0.7.2

Jan 30, 2025

0.7.1.2

Jan 30, 2025

0.7.1.1

Jan 30, 2025

0.7.1

Jan 30, 2025

0.7.0

Jan 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llava_caption-0.7.2.tar.gz (25.0 kB view details)

Uploaded Jan 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llava_caption-0.7.2-py3-none-any.whl (29.3 kB view details)

Uploaded Jan 30, 2025 Python 3

File details

Details for the file llava_caption-0.7.2.tar.gz.

File metadata

Download URL: llava_caption-0.7.2.tar.gz
Upload date: Jan 30, 2025
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.7.2.tar.gz
Algorithm	Hash digest
SHA256	`1edddbd0b475f87eee595fd531f92edfe0b7c049cf82736728e8a8a722e56292`
MD5	`ad71f46cedebb1d814f3860875794bc7`
BLAKE2b-256	`25f4412f3518aca167bfb944ca06eb3dffdf61840c0414dd7749aacbe3d88a0b`

See more details on using hashes here.

File details

Details for the file llava_caption-0.7.2-py3-none-any.whl.

File metadata

Download URL: llava_caption-0.7.2-py3-none-any.whl
Upload date: Jan 30, 2025
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for llava_caption-0.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a10b1c63abb89dd183023c1c464bf74efe0d8d0e61737180a14643a55d904fa`
MD5	`3b43a416b4ad451cbf9842f264fd6143`
BLAKE2b-256	`4dcfc2eba27aed4c398bae5ce1c58d77e06191a63331056ddf4973825b1ef248`

See more details on using hashes here.

llava-caption 0.7.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LLaVA-Caption

Overview

Available Models

MLXModel (Recommended)(Default)

VisionModel

DualModel (Experimental)

Additional Models

Installation

Prerequisites

Quick Start

Usage

Basic Commands

Command Line Options

Example Usage Patterns

Important Notes

PC Users

Model Downloads

Resource Requirements

File Handling

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes