Skip to main content

A tool for analyzing videos using Vision models

Project description

Video Analysis using vision models like Llama3.2 Vision and OpenAI's Whisper Models

A video analysis tool that combines vision models like Llama's 11B vision model and Whisper to create a description by taking key frames, feeding them to the vision model to get details. It uses the details from each frame and the transcript, if available, to describe what's happening in the video.

Table of Contents

Features

  • 💻 Can run completely locally - no cloud services or API keys needed
  • ☁️ Or, leverage any OpenAI API compatible LLM service (openrouter, openai, etc) for speed and scale
  • 🎬 Intelligent key frame extraction from videos
  • 🔊 High-quality audio transcription using OpenAI's Whisper
  • 👁️ Frame analysis using Ollama and Llama3.2 11B Vision Model
  • 📝 Natural language descriptions of video content
  • 🔄 Automatic handling of poor quality audio
  • 📊 Detailed JSON output of analysis results
  • ⚙️ Highly configurable through command line arguments or config file

Design

The system operates in three stages:

  1. Frame Extraction & Audio Processing

    • Uses OpenCV to extract key frames
    • Processes audio using Whisper for transcription
    • Handles poor quality audio with confidence checks
  2. Frame Analysis

    • Analyzes each frame using vision LLM
    • Each analysis includes context from previous frames
    • Maintains chronological progression
    • Uses frame_analysis.txt prompt template
  3. Video Reconstruction

    • Combines frame analyses chronologically
    • Integrates audio transcript
    • Uses first frame to set the scene
    • Creates comprehensive video description

Design

Requirements

System Requirements

  • Python 3.11 or higher
  • FFmpeg (required for audio processing)
  • When running LLMs locally (not necessary when using openrouter)
    • At least 16GB RAM (32GB recommended)
    • GPU at least 12GB of VRAM or Apple M Series with at least 32GB

Installation

  1. Clone the repository:
git clone https://github.com/byjlw/video-analyzer.git
cd video-analyzer
  1. Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install the package:
pip install .  # For regular installation
# OR
pip install -e .  # For development installation
  1. Install FFmpeg:
  • Ubuntu/Debian:
    sudo apt-get update && sudo apt-get install -y ffmpeg
    
  • macOS:
    brew install ffmpeg
    
  • Windows:
    choco install ffmpeg
    

Ollama Setup

  1. Install Ollama following the instructions at ollama.ai

  2. Pull the default vision model:

ollama pull llama3.2-vision
  1. Start the Ollama service:
ollama serve

OpenAI-compatible API Setup (Optional)

If you want to use OpenAI-compatible APIs (like OpenRouter or OpenAI) instead of Ollama:

  1. Get an API key from your provider:

  2. Configure via command line:

    # For OpenRouter
    video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://openrouter.ai/api/v1 --model gpt-4o
    
    # For OpenAI
    video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://api.openai.com/v1 --model gpt-4o
    

    Or add to config/config.json:

    {
      "clients": {
        "default": "openai_api",
        "openai_api": {
          "api_key": "your-api-key",
          "api_url": "https://openrouter.ai/api/v1"  # or https://api.openai.com/v1
        }
      }
    }
    

Note: With OpenRouter, you can use llama 3.2 11b vision for free by adding :free to the model name

Design

For detailed information about the project's design and implementation, including how to make changes, see docs/DESIGN.md.

Usage

For detailed usage instructions and all available options, see docs/USAGES.md.

Quick Start

# Local analysis with Ollama (default)
video-analyzer video.mp4

# Cloud analysis with OpenRouter
video-analyzer video.mp4 \
    --client openai_api \
    --api-key your-key \
    --api-url https://openrouter.ai/api/v1 \
    --model meta-llama/llama-3.2-11b-vision-instruct:free

# Analysis with custom prompt
video-analyzer video.mp4 \
    --prompt "What activities are happening in this video?" \
    --whisper-model large

Output

The tool generates a JSON file (output\analysis.json) containing:

  • Metadata about the analysis
  • Audio transcript (if available)
  • Frame-by-frame analysis
  • Final video description

Sample Output

The video begins with a person with long blonde hair, wearing a pink t-shirt and yellow shorts, standing in front of a black plastic tub or container on wheels. The ground appears to be covered in wood chips.\n\nAs the video progresses, the person remains facing away from the camera, looking down at something inside the tub. ........

full sample output in docs/sample_analysis.json

Configuration

The tool uses a cascading configuration system with command line arguments taking highest priority, followed by user config (config/config.json), and finally the default config. See docs/USAGES.md for detailed configuration options.

Uninstallation

To uninstall the package:

pip uninstall video-analyzer

License

Apache License

Contributing

We welcome contributions! Please see docs/CONTRIBUTING.md for detailed guidelines on how to:

  • Review the project design
  • Propose changes through GitHub Discussions
  • Submit pull requests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

video_analyzer-0.1.1.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

video_analyzer-0.1.1-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file video_analyzer-0.1.1.tar.gz.

File metadata

  • Download URL: video_analyzer-0.1.1.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for video_analyzer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 304da0812b49cd761383478527bfbcf9e27f63400e95b909becbda209e2b2391
MD5 e740a7d52ccbdb2fbdcc5ee5e7d6d0d1
BLAKE2b-256 9b95643a6721a6088f3ccd8a5612edb70fe7fcb6f0549cc29a947bbca18848a1

See more details on using hashes here.

File details

Details for the file video_analyzer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for video_analyzer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 954a11504608fe5f5594d46f9221bdceff664bc759a9469b64a37d5b8c4f7c86
MD5 354e9f3b055f1b914b544d9ed562268f
BLAKE2b-256 f3ba343702b730dcc1efb7ea576b1f6c26066653bc8e8f57d50c1bf725ab74c9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page