Skip to main content

RAG over audio files with provider-agnostic pipeline

Project description

AudioRAG

Provider-agnostic RAG pipeline for audio content. Transcribe, chunk, embed, and search audio from local files.

Features

  • Multi-provider support: OpenAI, Deepgram, AssemblyAI, Groq (STT); OpenAI, Voyage, Cohere (embeddings); OpenAI, Anthropic, Gemini (generation); ChromaDB, Pinecone, Weaviate, Supabase (vector stores)
  • Local file support: Process audio files and directories from local storage
  • Batch indexing: Index multiple files and directories in one command
  • Source discovery: Recursively scan directories for audio files
  • Resumable processing: SQLite state tracking with hash-based IDs
  • Provider-aware vector IDs: Canonical SHA-256 chunk IDs with optional UUID5 conversion per vector store
  • Proactive budget governor: Optional fail-fast limits for RPM, TPM, and audio-seconds/hour
  • Atomic vector verification: Optional post-write verification with strict or best-effort modes
  • Automatic chunking: Time-based segmentation with configurable duration
  • Audio splitting: Handles large files by splitting before transcription
  • Structured logging: Context-aware logging with operation timing
  • Graceful exit: Automatic resource cleanup on completion, failure, or interruption
  • Type-safe: Python 3.12+ with full type annotations

Quick Start

import asyncio
from audiorag import AudioRAGPipeline, AudioRAGConfig

async def main():
    # Configure with your chosen providers
    config = AudioRAGConfig(
        stt_provider="openai",
        stt_model="whisper-1",
        embedding_provider="openai",
        embedding_model="text-embedding-3-small",
        vector_store_provider="chromadb",
        generation_provider="openai",
        generation_model="gpt-4o-mini",
        # API keys can also be set via environment variables
        openai_api_key="sk-...",
    )
    
    # Use async context manager for automatic resource cleanup
    async with AudioRAGPipeline(config) as pipeline:
        # Index audio from local file
        await pipeline.index("./podcast_episode.mp3")

    # Batch indexing with partial-failure reporting
        batch_result = await pipeline.index_many(
            [
                "./podcasts/",
                "./interview.wav",
                "./lecture.mp3",
            ],
            raise_on_error=False,
        )
        print(
            f"Indexed={len(batch_result.indexed_sources)} "
            f"Skipped={len(batch_result.skipped_sources)} "
            f"Failed={len(batch_result.failures)}"
        )

        # Query the indexed content
        result = await pipeline.query("What are the main points discussed?")
        print(result.answer)

        # Access sources with timestamps
        for source in result.sources:
            print(f"{source.title} at {source.start_time}s")
            print(f"URL: {source.source_url}")

asyncio.run(main())

Installation

# Install with uv (recommended)
uv add audiorag

# Or with pip
pip install audiorag

Optional Dependencies

# All providers and utilities
uv add audiorag[all]  # or: pip install audiorag[all]

# Specific providers only
uv add audiorag[openai,chromadb,cohere]

Command Line Interface

AudioRAG includes a premium CLI for easy setup, indexing, and querying.

Setup

Configure your providers and API keys interactively:

audiorag setup

This will guide you through selecting providers for STT, embeddings, vector stores, and generation, saving them to a .env file.

Indexing

Index audio from local files and directories:

# Single audio file
audiorag index "./podcast_episode.mp3"

# Directory (recursively finds all audio files)
audiorag index "./podcasts/"

# Multiple files at once
audiorag index "./interview.wav" "./lecture.mp3"

# Multiple directories
audiorag index "./podcasts/" "./lectures/"

Note: Always wrap paths containing spaces in quotes.

Options:

  • --force: Re-process and re-index even if the file has been processed before.

The CLI automatically:

  • Recursively discovers audio files in directories
  • Shows aggregate batch results (indexed/skipped/failed) with per-source failures
  • Handles errors per source without stopping the entire batch

Querying

Ask questions about your indexed audio content with a sophisticated results layout:

audiorag query "What are the main points discussed in the audio?"

Configuration

AudioRAG uses pydantic-settings with environment variable support. All settings use the AUDIORAG_ prefix.

# Example: Using OpenAI for STT, embeddings, and generation
export AUDIORAG_OPENAI_API_KEY="sk-..."
export AUDIORAG_STT_PROVIDER="openai"
export AUDIORAG_EMBEDDING_PROVIDER="openai"
export AUDIORAG_VECTOR_STORE_PROVIDER="chromadb"
export AUDIORAG_GENERATION_PROVIDER="openai"

# Example: Using different providers
export AUDIORAG_DEEPGRAM_API_KEY="..."
export AUDIORAG_STT_PROVIDER="deepgram"
export AUDIORAG_VOYAGE_API_KEY="..."
export AUDIORAG_EMBEDDING_PROVIDER="voyage"

# Processing settings
export AUDIORAG_CHUNK_DURATION_SECONDS="30"
export AUDIORAG_RETRIEVAL_TOP_K="10"
export AUDIORAG_RERANK_TOP_N="3"

# Optional budget governor
export AUDIORAG_BUDGET_ENABLED="true"
export AUDIORAG_BUDGET_RPM="60"
export AUDIORAG_BUDGET_TPM="120000"
export AUDIORAG_BUDGET_AUDIO_SECONDS_PER_HOUR="7200"

# Optional vector write verification
export AUDIORAG_VECTOR_STORE_VERIFY_MODE="best_effort"  # off | best_effort | strict
export AUDIORAG_VECTOR_STORE_VERIFY_MAX_ATTEMPTS="5"
export AUDIORAG_VECTOR_STORE_VERIFY_WAIT_SECONDS="0.5"

# Optional vector ID strategy
export AUDIORAG_VECTOR_ID_FORMAT="auto"  # auto | sha256 | uuid5
export AUDIORAG_VECTOR_ID_UUID5_NAMESPACE="6ba7b810-9dad-11d1-80b4-00c04fd430c8"  # optional

See Configuration Guide for all options.

Documentation

Development

# Clone and setup
git clone <repository-url>
cd audiorag
uv sync

# Run tests
uv run pytest

# Run checks
uv run ruff check . --fix
uv run ty check

# Install pre-commit hooks
uv run prek install

Pipeline Stages

  1. Ingest: Load audio from local files
  2. Split: Divide large files into processable chunks
  3. Transcribe: Convert audio to text using STT provider
  4. Chunk: Group transcription into time-based segments
  5. Embed: Generate vector embeddings for each chunk
  6. Store: Persist embeddings in vector database

Reliability Controls

  • Budget governor (AUDIORAG_BUDGET_ENABLED=true): reserves budget before expensive calls and fails fast with BudgetExceededError when limits would be exceeded.
  • Duration reconciliation: actual audio duration is compared to estimated duration after processing, with automatic budget adjustment.
  • Preflight transcription reservation: when audio duration is known, indexing reserves full audio-seconds budget before STT starts.
  • Persistent budget accounting: budget usage is persisted in SQLite for cross-process and restart safety.
  • Vector write verification: after add(), providers that support verify(ids) are checked.
  • Verification modes: off disables checks, best_effort warns on failure, strict fails indexing when verification fails.
  • Provider-aware vector IDs: state IDs stay SHA-256; vector-store IDs can be auto-resolved to UUID5 for UUID-oriented providers.
  • Safe strategy changes: if vector ID strategy changes for an existing source, reindex with force=True to avoid mixed IDs.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiorag-0.15.0.tar.gz (54.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audiorag-0.15.0-py3-none-any.whl (84.2 kB view details)

Uploaded Python 3

File details

Details for the file audiorag-0.15.0.tar.gz.

File metadata

  • Download URL: audiorag-0.15.0.tar.gz
  • Upload date:
  • Size: 54.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for audiorag-0.15.0.tar.gz
Algorithm Hash digest
SHA256 91bb0dc6860b6eef4f1c7e48727ddda680c4d10e2d1a9d7e995701890f30a452
MD5 8ac1a156058c466def221d771826a693
BLAKE2b-256 a6603a6f6b063945bb54b00e911924c2f04c23bda6337daf47b074f0ab6ab248

See more details on using hashes here.

File details

Details for the file audiorag-0.15.0-py3-none-any.whl.

File metadata

  • Download URL: audiorag-0.15.0-py3-none-any.whl
  • Upload date:
  • Size: 84.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for audiorag-0.15.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df9a74c21e599089d23b6fe3e04207ddc31cf757eda480b47b639e7f76c28d9a
MD5 6785fefdd2c2d1f05171be4cf60d15ac
BLAKE2b-256 26ad03ea7a8d21b25622cade1792023bc6835f242e1ac4537bb6682c120f4f1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page