A Python module to capture knowledge from documents using Vision Language Models (VLMs)

These details have not been verified by PyPI

Project links

Project description

AI Vision Capture

A Python library for extracting and analyzing content from PDF, Image, and Video files using Vision Language Models (VLMs). Supports multiple providers including OpenAI, Anthropic Claude, Google Gemini, and Azure OpenAI.

Package Name: aicapture Python Version: >=3.10 Build System: hatchling via uv

Features

Multi-Provider Support: OpenAI, Claude, Gemini, Azure OpenAI, AWS Bedrock
Document Processing: PDFs and images (JPG, PNG, TIFF, WebP, BMP)
Video Processing: Frame extraction and analysis from MP4, AVI, MOV, MKV
Audio Transcription: Whisper-powered speech-to-text with timestamps (LRC format)
Async Processing: Configurable concurrency for batch operations
Two-Layer Caching: Local file system + optional S3 cloud cache
Structured Output: Template-based data extraction to JSON
Pluggable Architecture: Provider abstraction with auto-detection

Installation

pip install aicapture

# With video transcription support (adds moviepy for audio extraction)
pip install aicapture[video]

Environment Setup

Set your provider API key (auto-detection picks the first available):

export OPENAI_API_KEY=your_key      # OpenAI
# or
export GEMINI_API_KEY=your_key      # Gemini
# or
export ANTHROPIC_API_KEY=your_key   # Anthropic

Optional settings:

export USE_VISION=openai            # Force specific provider (openai|claude|gemini|azure-openai|anthropic_bedrock)
export MAX_CONCURRENT_TASKS=5       # Concurrent API requests (default: 20)
export VISION_PARSER_DPI=333        # PDF rendering quality

See .env.template for full reference of available configuration options.

Usage

1. Document Parsing

from aicapture import VisionParser

parser = VisionParser()

# Process PDF or image
result = parser.process_pdf("document.pdf")
result = parser.process_image("photo.jpg")

# Batch processing
async def batch():
    return await parser.process_folder_async("path/to/folder")

2. Structured Data Capture

from aicapture import VisionCapture, OpenAIVisionModel

model = OpenAIVisionModel(model="gpt-4.1", api_key="your_key")
capture = VisionCapture(vision_model=model)

template = """
alarm:
  description: string
  tag: string
  ref_logica: integer
"""

result = await capture.capture(file_path="diagram.png", template=template)

3. Video Processing

from aicapture import VidCapture, VideoConfig

config = VideoConfig(
    frame_rate=2,                # 2 frames per second
    max_duration_seconds=30,     # Max video length to process
    target_frame_size=(768, 768),
)

vid = VidCapture(config=config)
result = vid.process_video("video.mp4", "Describe what happens in this video.")

4. Video Processing with Audio Transcription

Extract speech from video using OpenAI Whisper and include it as context for richer analysis:

from aicapture import VidCapture, VideoConfig

config = VideoConfig(
    frame_rate=2,                    # Extract 2 frames per second
    enable_transcription=True,       # Enable Whisper transcription
    transcription_model="whisper-1", # OpenAI Whisper model
    max_duration_seconds=600,        # Process up to 10 minutes
    transcription_language="en",     # Optional language hint (None for auto-detect)
    # Transcriptions are cached in: tmp/.vid_capture_cache/transcriptions/
    cache_dir="tmp/.vid_capture_cache",

)

vid = VidCapture(config=config)
result = vid.process_video("lecture.mp4", "Summarize the key points discussed.")

The transcription is extracted with segment-level timestamps (LRC format) and automatically appended to the VLM prompt:

[00:00.00] Welcome to this tutorial on machine learning.
[00:05.32] Today we will cover three main topics.
[00:09.15] First, let's discuss data preprocessing.

You can also use the transcriber directly:

from aicapture import OpenAIAudioTranscriber

transcriber = OpenAIAudioTranscriber(model="whisper-1")
transcription = transcriber.transcribe_video("video.mp4")

print(transcription.to_lrc())        # LRC-formatted output
print(transcription.full_text)       # Plain text
print(transcription.segments)        # List of TimestampedSegments

Advanced Configuration

from aicapture import VisionParser, GeminiVisionModel

model = GeminiVisionModel(model="gemini-2.5-flash", api_key="your_key")

parser = VisionParser(
    vision_model=model,
    dpi=400,
    prompt="Extract equipment specs, operating parameters, and safety protocols.",
)

result = parser.process_pdf("technical_manual.pdf")

Architecture

Core Components

VisionParser (aicapture/vision_parser.py)
- Main entry point for document processing (PDFs, images)
- Handles PDF-to-image conversion using PyMuPDF (fitz)
- Manages concurrent processing with semaphore-based throttling
- Supports batch processing via process_folder_async()
- Default DPI: 333 (configurable via VISION_PARSER_DPI env var)
Vision Models (aicapture/vision_models.py)
- Abstract VisionModel base class defines the interface
- Concrete implementations for each provider:
  - OpenAIVisionModel - OpenAI GPT-4 Vision
  - AnthropicVisionModel - Claude models
  - GeminiVisionModel - Google Gemini
  - AzureOpenAIVisionModel - Azure OpenAI
  - AnthropicAWSBedrockVisionModel - Claude via AWS Bedrock
- AutoDetectVisionModel() - Auto-selects first available provider based on API keys
- create_default_vision_model() - Factory function respecting USE_VISION env var
VisionCapture (aicapture/vision_capture.py)
- Structured data extraction using customizable templates
- Processes images/PDFs and returns structured JSON based on template schema
- Used for extracting specific fields from documents (e.g., forms, technical diagrams)
VidCapture (aicapture/vid_capture.py)
- Video frame extraction and analysis using OpenCV
- Extracts frames at configurable rate (default: 2 fps)
- Supports MP4, AVI, MOV, MKV formats
- Resizes frames to target size (default: 768x768)
- Analyzes frames with VLM using custom prompts
Cache System (aicapture/cache.py)
- Two-layer caching: local file system + optional S3 cloud storage
- FileCache - Local JSON-based cache (default: tmp/.vision_parser_cache/)
- S3Cache - AWS S3-backed cache for distributed systems
- TwoLayerCache - Combines both with automatic fallback
- ImageCache - Specialized cache for image preprocessing results
- Cache keys use SHA-256 hashing of inputs (file content + prompt)
Settings (aicapture/settings.py)
- Centralized configuration using environment variables
- Auto-loads from .env file via python-dotenv
- Provider selection via USE_VISION (openai|claude|gemini|azure-openai|anthropic_bedrock)
- Image quality settings: ImageQuality.LOW_RES (512x512) or HIGH_RES (768x2000)
- Concurrency control: MAX_CONCURRENT_TASKS (default: 20)

Key Design Patterns

Provider Abstraction: All VLM providers implement the same VisionModel interface, making them interchangeable
Auto-Detection: If USE_VISION is not set, the library auto-detects available providers by checking for API keys (order: Gemini → OpenAI → Azure → Anthropic)
Async-First: Most operations support async/await for efficient concurrent processing
Caching Strategy: Two-layer cache (local + cloud) reduces API calls and costs
Template-Based Extraction: VisionCapture uses YAML-like templates to define expected data structure

Development

Setup

# Install dependencies (requires uv: https://github.com/astral-sh/uv)
uv sync --all-extras

Code Quality

# Format code with ruff
make format

# Run linters (ruff + mypy) - matches CI exactly
make lint

# Run all checks (format + lint + test)
make all

Testing

# Run all tests with coverage
make test

# Run tests with pytest directly (more control)
uv run pytest -v --cov=aicapture --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_vision_parser_extended.py -v

# Run specific test function
uv run pytest tests/test_vision_parser_extended.py::test_specific_function -v

Build & Publish

# Build package for distribution
make build
# Or: uv build

# Publish to PyPI
make publish
# Or: uv publish

Code Style

Linting: Ruff (replaces black, isort, flake8)
- Line length: 120 characters
- Python 3.10+ syntax
- See pyproject.toml [tool.ruff] for rules
Type Checking: MyPy with strict mode
- All functions in aicapture/ must have type annotations
- Tests are exempt from strict typing (disallow_untyped_defs = false)
Formatting: Ruff format (auto-fixes most issues)

Testing Requirements

All tests use pytest with async support (pytest-asyncio)
Tests requiring API keys are controlled via environment variables
CI runs ruff check, mypy, and pytest on every PR
Coverage target: Tests must cover new functionality
Test files follow pattern: test_*.py in tests/ directory

Important Development Notes

PyMuPDF Import: Use import fitz (not pymupdf) for PDF processing
Circular Import Handling: vid_capture.py imports directly from vision_models to avoid circular dependencies
Cache Invalidation: Pass invalidate_cache=True to constructors to force fresh results (bypasses cache reads)
Image Quality: Use ImageQuality.HIGH_RES for best OCR results, LOW_RES for faster processing
Video Duration Limit: Default max 300 seconds (5 min) per video (configurable via VideoConfig.max_duration_seconds)
Supported Image Formats: JPG, JPEG, PNG, TIFF, WebP, BMP

Release Process

⚠️ IMPORTANT: Before committing version changes, remember to bump the Python version in pyproject.toml

When preparing a new release:

Update version in pyproject.toml (both package version and Python version if applicable)
Run all tests: make all
Build the package: make build
Create commit with version bump
Tag the release
Publish to PyPI: make publish

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/your-feature)
Run make all to ensure code quality
Commit your changes (remember to update version in pyproject.toml if needed)
Push and open a Pull Request

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Mar 5, 2026

0.6.3

Mar 2, 2026

0.6.2

Feb 23, 2026

0.6.1

Feb 23, 2026

0.6.0

Feb 23, 2026

0.5.4

Feb 15, 2026

0.5.3

Feb 13, 2026

0.5.1

Jan 27, 2026

This version

0.5.0

Jan 24, 2026

0.4.0

Jan 24, 2026

0.3.7

Jan 5, 2026

0.3.6

Nov 27, 2025

0.3.5

Oct 3, 2025

0.3.4

Oct 3, 2025

0.3.3

Aug 14, 2025

0.3.2

Aug 7, 2025

0.3.1

Jul 5, 2025

0.3.0

Jun 3, 2025

0.2.11

May 2, 2025

0.2.10

Apr 29, 2025

0.2.9

Apr 20, 2025

0.2.8

Apr 15, 2025

0.2.7

Apr 12, 2025

0.2.6

Apr 11, 2025

0.2.5

Apr 11, 2025

0.2.4

Apr 3, 2025

0.2.3

Apr 3, 2025

0.2.2

Apr 3, 2025

0.2.1

Apr 3, 2025

0.2.0

Apr 2, 2025

0.1.10

Apr 2, 2025

0.1.9

Apr 2, 2025

0.1.8

Mar 19, 2025

0.1.7

Mar 15, 2025

0.1.5

Mar 15, 2025

0.1.4

Mar 10, 2025

0.1.3

Mar 6, 2025

0.1.2

Mar 5, 2025

0.1.1

Mar 4, 2025

0.1.0

Mar 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aicapture-0.5.0.tar.gz (58.6 MB view details)

Uploaded Jan 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aicapture-0.5.0-py3-none-any.whl (44.5 kB view details)

Uploaded Jan 24, 2026 Python 3

File details

Details for the file aicapture-0.5.0.tar.gz.

File metadata

Download URL: aicapture-0.5.0.tar.gz
Upload date: Jan 24, 2026
Size: 58.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for aicapture-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`61eaccfcb97055678da1230845258015adb3db22cedfe5e3c8bbcb5dcf04cf51`
MD5	`90497b7802b31a718d1cc5792f225a57`
BLAKE2b-256	`0b4a3383be992ab58d7de449a1031b86f0cf6d7bb75d0a5bf6ef7ca00ea68045`

See more details on using hashes here.

File details

Details for the file aicapture-0.5.0-py3-none-any.whl.

File metadata

Download URL: aicapture-0.5.0-py3-none-any.whl
Upload date: Jan 24, 2026
Size: 44.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.17

File hashes

Hashes for aicapture-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a81868b5d18fbb5d97344aa7db0ac335ce1600bc4a80c13f33469cbf7e22d46`
MD5	`1769893ef7aa514eb83238ea21c2548a`
BLAKE2b-256	`872923d1f09b7707028aa24710641ceaf5c606c2581f510d47caf7079482b9df`

See more details on using hashes here.

aicapture 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AI Vision Capture

Features

Installation

Environment Setup

Usage

1. Document Parsing

2. Structured Data Capture

3. Video Processing

4. Video Processing with Audio Transcription

Advanced Configuration

Architecture

Core Components

Key Design Patterns

Development

Setup

Code Quality

Testing

Build & Publish

Code Style

Testing Requirements

Important Development Notes

Release Process

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes