utAutomatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.
Project description
LLaVA-Caption
By David "Zanshinmu" Van de Ven zanshin.g1@gmail.com
Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.
Overview
LLaVA Caption was designed to solve a specific problem in AI training: when using generated images, the original prompts often contain elements that aren't present in the final images. Manual verification and captioning is time-consuming, but inaccurate captions make for bad training data. This tool provides higher quality captions than BLIP, with options ranging from basic processing to near-manual quality.
Llava Caption was built and tested on Apple Silicon. While cross-platform tools make it accessible to PCs, it hasn’t been tested on Linux or Windows.
Available Models
MLXModel (Recommended)(Default)
- Uses Qwen2-VL-7B-Instruct-8bit with Apple's MLX framework
- Apple Silicon only
- Fast processing with 16GB unified memory
- Accuracy comparable to VisionModel
- Requirements: Apple Silicon Mac, 16GB+ unified memory
VisionModel
- Uses Llama 3.2 Vision via Ollama
- High accuracy with moderate resource requirements
- Excellent results with secondary caption generation
- Ideal for training Flux/SD3
- Requirements: 24GB RAM, GPU recommended
DualModel (Experimental)
- Combines LLaVA 1.5 and Mixtral
- Highest potential accuracy but resource-intensive
- Supports distributed processing across machines
- Currently experimental: may need optimization
- Requirements: 64GB RAM, GPU strongly recommended
Additional Models
- OLModel: Basic Ollama-based processing
- HFModel: Hugging Face transformers-based processing (Note: MPS not supported on Apple Silicon)
- LCPModel: Direct LLaMA C++ processing
Installation
Prerequisites
- Python 3.10 (python.org)
- Git (git-scm.com)
- Ollama (ollama.com/download) - Required for Ollama-based models
Quick Start
# Clone repository
git clone https://github.com/yourusername/llava-caption.git
cd llava-caption
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install package
pip install -e .
# Verify installation
llava-caption --help
Usage
Basic Commands
# Basic usage with defaults
llava-caption /path/to/images/
# Specific model selection
llava-caption --model MLXModel /path/to/images/
# Direct captioning (no prompt comparison)
llava-caption --direct-caption /path/to/images/
Command Line Options
llava-caption [OPTIONS] DIRECTORY
Arguments:
DIRECTORY Directory containing images
Model Selection:
--model MODEL Model to use (default: MLXModel)
[env: LLAVA_PROCESSOR]
Processing Modes:
--direct-caption Enable direct captioning mode
--secondary-caption Enable secondary captioning
[env: SECONDARY_CAPTION]
--no-preprocess Disable text preprocessing
[env: PREPROCESSOR]
Model Parameters:
--temperature FLOAT Generation temperature (default: 0.0)
[env: TEMPERATURE]
--gpu-layers INT GPU layers (-1 for all)
[env: N_GPU_LAYERS]
Ollama Configuration:
--ollama-address HOST:PORT Ollama address (default: 127.0.0.1:11434)
[env: OLLAMA_REMOTEHOST]
Logging:
--logging Enable detailed logging
--sys-logging Enable system logging
Example Usage Patterns
# MLXModel with direct captioning
llava-caption --model MLXModel --direct-caption /path/to/images/
# VisionModel with remote Ollama
llava-caption --model VisionModel --ollama-address 192.168.1.110:11434 /path/to/images/
# Secondary captioning with higher temperature
llava-caption --model VisionModel --secondary-caption --temperature 0.7 /path/to/images/
# Debug mode
llava-caption --logging --sys-logging /path/to/images/
Important Notes
PC Users
- You may need to remove any mlx entries from requirements.txt to install successfully.
Model Downloads
- Models are automatically downloaded via Hugging Face Hub or Ollama
- Initial downloads may take time and significant disk space
- Models are selected for optimal performance and resource usage
Resource Requirements
- CPU Mode: Significant CPU and RAM usage, especially with HFModel
- GPU Usage: Set
TORCH_DEVICE="cuda:0"for Nvidia GPU support - Distributed Processing: Possible to run models across 2 hosts using DualModel
File Handling
- Expects matching .png and .txt files in target directory
- Existing text files will be overwritten with new captions
- In direct caption mode, creates new .txt files for each image
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llava_caption-0.7.2.tar.gz.
File metadata
- Download URL: llava_caption-0.7.2.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1edddbd0b475f87eee595fd531f92edfe0b7c049cf82736728e8a8a722e56292
|
|
| MD5 |
ad71f46cedebb1d814f3860875794bc7
|
|
| BLAKE2b-256 |
25f4412f3518aca167bfb944ca06eb3dffdf61840c0414dd7749aacbe3d88a0b
|
File details
Details for the file llava_caption-0.7.2-py3-none-any.whl.
File metadata
- Download URL: llava_caption-0.7.2-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.12.8 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a10b1c63abb89dd183023c1c464bf74efe0d8d0e61737180a14643a55d904fa
|
|
| MD5 |
3b43a416b4ad451cbf9842f264fd6143
|
|
| BLAKE2b-256 |
4dcfc2eba27aed4c398bae5ce1c58d77e06191a63331056ddf4973825b1ef248
|