Skip to main content

A Python module to capture knowledge from documents using Vision Language Models (VLMs)

Project description

AI Vision Capture

A powerful Python library for extracting and analyzing content from PDF, Image, and Video files using Vision Language Models (VLMs). This library provides a flexible and efficient way to process documents with support for multiple VLM providers including OpenAI, Anthropic Claude, Google Gemini, and Azure OpenAI.

Features

  • 🔍 Multi-Provider Support: Compatible with major VLM providers (OpenAI, Claude, Gemini, Azure, OpenSource models)
  • 📄 Document Processing: Process PDFs and images (JPG, PNG, TIFF, WebP, BMP)
  • 🎥 Video Processing: Extract and analyze frames from video files (MP4, AVI, MOV, MKV)
  • 🚀 Async Processing: Asynchronous processing with configurable concurrency
  • 💾 Two-Layer Caching: Local file system and cloud caching for improved performance
  • 🔄 Batch Processing: Process multiple documents in parallel
  • 📝 Text Extraction: Enhanced accuracy through combined OCR and VLM processing
  • 🎨 Image Quality Control: Configurable image quality settings
  • 📊 Structured Output: Well-organized JSON and Markdown output

Installation

pip install aicapture

Environment Setup

  1. Set your chosen provider and API key:
# For OpenAI
export USE_VISION=openai
export OPENAI_API_KEY=your_openai_key

# For Anthropic
export USE_VISION=anthropic
export ANTHROPIC_API_KEY=your_anthropic_key

# For Gemini
export USE_VISION=gemini
export GEMINI_API_KEY=your_google_key
  1. Optional performance settings:
export MAX_CONCURRENT_TASKS=5      # Number of concurrent processing tasks
export VISION_PARSER_DPI=333      # Image DPI for PDF processing

Core Capabilities

1. Document Parsing

The VisionParser provides general document processing capabilities for extracting unstructured content from documents.

from aicapture import VisionParser

# Initialize parser
parser = VisionParser()

# Process a single PDF
result = parser.process_pdf("path/to/your/document.pdf")

# Process a single image
result = parser.process_image("path/to/your/image.jpg")

# Process multiple documents asynchronously
async def process_folder():
    return await parser.process_folder_async("path/to/folder")

Parser Output Format

{
  "file_object": {
    "file_name": "example.pdf",
    "file_hash": "sha256_hash",
    "total_pages": 10,
    "total_words": 5000,
    "pages": [
      {
        "page_number": 1,
        "page_content": "extracted content",
        "page_hash": "sha256_hash"
      }
    ]
  }
}

2. Structured Data Capture

The VisionCapture component enables extraction of structured data from images using customizable templates.

  1. Define your data template:
# Example template for technical alarm logic
ALARM_TEMPLATE = """
alarm:
  description: string  # Main alarm description
  destination: string # Destination system
  tag: string        # Alarm tag
  ref_logica: integer # Logic reference number

dependencies:
  type: array
  items:
    - signal_name: string  # Name of the dependency signal
      source: string      # Source system/component
      tag: string        # Signal tag
      ref_logica: integer|null  # Logic reference (can be null)
"""
  1. Use with OpenAI Vision:
from aicapture import VisionCapture
from aicapture import OpenAIVisionModel

vision_model = OpenAIVisionModel(
    model="gpt-4.1",
    max_tokens=4096,
    api_key="your_openai_key"
)

capture = VisionCapture(vision_model=vision_model)
result = await capture.capture(
    file_path="path/to/image.png",
    template=ALARM_TEMPLATE
)
  1. Or use with Anthropic Claude:
from aicapture import AnthropicVisionModel

vision_model = AnthropicVisionModel(
    model="claude-3-sonnet-20240620",
    max_tokens=4096,
    api_key="your_anthropic_key"
)

capture = VisionCapture(vision_model=vision_model)
result = await capture.capture(
    file_path="path/to/example.pdf",
    template=ALARM_TEMPLATE
)

3. Video Processing

The VidCapture component enables extraction of knowledge from video files by extracting frames and analyzing them with VLMs.

from aicapture import VidCapture, VideoConfig

# Configure video capture with custom settings
config = VideoConfig(
    frame_rate=2,                         # Extract 2 frames per second
    max_duration_seconds=30,              # Process up to 30 seconds of video
    target_frame_size=(768, 768),         # Resize frames for optimal processing
    supported_formats=(".mp4", ".avi", ".mov", ".mkv")
)

# Initialize video capture
video_capture = VidCapture(config)

# Process a video file with a custom prompt
result = video_capture.process_video(
    video_path="path/to/your/video.mp4",
    prompt="Describe what is happening in this video."
)

# Or extract frames for custom processing
frames, interval = video_capture.extract_frames("path/to/your/video.mp4")
print(f"Extracted {len(frames)} frames at {interval:.2f}s intervals")

# Analyze the extracted frames with a custom prompt
result = video_capture.capture(
    prompt="Analyze these video frames and describe key objects and actions.",
    images=frames
)

Advanced Usage

Custom Vision Model Configuration

from aicapture import VisionParser, GeminiVisionModel

# Configure Gemini vision model with custom settings
vision_model = GeminiVisionModel(
    model="gemini-2.5-flash-preview-04-17",
    api_key="your_gemini_api_key"
)

# Initialize parser with custom configuration
parser = VisionParser(
    vision_model=vision_model,
    dpi=400,
    prompt="""
    Please analyze this technical document and extract:
    1. Equipment specifications and model numbers
    2. Operating parameters and limits
    3. Maintenance requirements
    4. Safety protocols
    5. Quality control metrics
    """
)

# Process PDF with custom settings
result = parser.process_pdf(
    pdf_path="path/to/document.pdf",
)

Development Setup

For local development:

  1. Clone the repository
  2. Copy .env.template to .env
  3. Edit .env with your settings
  4. Install development dependencies: pip install -e ".[dev]"

See .env.template for all available configuration options.

Documentation

For detailed configuration options and examples, see:

Coming Soon

  • 🔗 Cross-Document Knowledge Capture: Capture structured knowledge across multiple documents

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/tiny-but-mighty)
  3. Commit your changes (git commit -m 'feat: add small but delightful improvement')
  4. Push to the branch (git push origin feature/tiny-but-mighty)
  5. Open a Pull Request

For detailed guidelines, see our Contributing Guide.

License

Copyright 2024 Aitomatic, Inc.

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aicapture-0.3.4.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aicapture-0.3.4-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file aicapture-0.3.4.tar.gz.

File metadata

  • Download URL: aicapture-0.3.4.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.4 Darwin/24.6.0

File hashes

Hashes for aicapture-0.3.4.tar.gz
Algorithm Hash digest
SHA256 337216234a42a65b3ad65b33d3789c66034d64f4ce1bd39489fd55aae87c3f92
MD5 1239fbce4b01002cc94a0370fa39b9d7
BLAKE2b-256 703d0a76fb6ded4aaf4dea2092ce77ad1517838dd1451f9e8aa987218bd8148d

See more details on using hashes here.

File details

Details for the file aicapture-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: aicapture-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.4 Darwin/24.6.0

File hashes

Hashes for aicapture-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 50c7942ad1627bb33df4e2c727f7db42b5cd616a242b5d69d53eebf71f0769dc
MD5 475ae81030fd6452aff6d2bacf3323fe
BLAKE2b-256 2d5a7d97fef8fcb99e6acbe02d659425d8e42560073b1bad6bab6943b38304e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page