Skip to main content

Async-first version tracking system for RAG applications

Project description

๐Ÿ”„ RAGVersion

Async-first version tracking system for RAG applications

PyPI version Python Support License: MIT Downloads Code style: black

GitHub stars GitHub forks GitHub issues GitHub pull requests Last Commit

๐Ÿ“– How-to Guide โ€ข Documentation โ€ข Roadmap โ€ข Contributing โ€ข PyPI


RAGVersion is a plug-and-play module that tracks document changes and integrates seamlessly with LangChain, LlamaIndex, and other RAG frameworks. It provides automatic version control, change detection, and content diffing for your document pipelines.

Key Features โ€ข Quick Start โ€ข Integrations โ€ข CLI โ€ข Documentation

โœจ Features

Core Capabilities

  • ๐Ÿš€ Async-first architecture - Built for modern Python async/await patterns
  • ๐Ÿ“ฆ Plug-and-play - Works with any RAG system
  • ๐Ÿ”„ Batch processing - Efficiently process large document collections
  • ๐Ÿ‘€ Real-time file watching - Automatic tracking with daemon mode
  • ๐Ÿ›ก๏ธ Resilient - Continue-on-error design for production systems

Integrations & Storage

  • ๐Ÿ’พ Zero-config SQLite - Default local storage, no setup required
  • โ˜๏ธ Supabase option - Cloud storage with PostgreSQL
  • ๐ŸŒ REST API - FastAPI-based HTTP API with automatic OpenAPI docs
  • ๐Ÿ”— Framework integrations - LangChain & LlamaIndex ready
  • ๐Ÿ“ Document parsing - PDF, DOCX, TXT, Markdown support
  • ๐Ÿ” Change detection - Automatic tracking with content hashing
  • โšก GitHub Actions - Automatic tracking in CI/CD pipelines
  • ๐Ÿ”” Smart notifications - Slack, Discord, Email, and webhook alerts
  • โšก Query optimization - 100-1000x faster queries with comprehensive indexing

๐ŸŽฏ Why RAGVersion?

Problem: RAG applications need to track when documents change to keep vector databases in sync, but most solutions require manual tracking or complex pipelines.

Solution: RAGVersion automatically detects document changes and provides version history, making it easy to maintain up-to-date RAG systems.

Perfect for:

  • ๐Ÿ“š Documentation sites that need to track content updates
  • ๐Ÿค– AI chatbots that need fresh knowledge bases
  • ๐Ÿ“Š Data pipelines processing evolving documents
  • ๐Ÿ”„ Systems requiring audit trails of document changes

๐Ÿ“ฆ Installation

# Basic installation
pip install ragversion

# With all parsers
pip install ragversion[parsers]

# With REST API support
pip install ragversion[api]

# With LangChain integration
pip install ragversion[langchain]

# With LlamaIndex integration
pip install ragversion[llamaindex]

# Everything (recommended)
pip install ragversion[all]

System Requirements:

  • Python 3.9+
  • (Optional) Supabase account for cloud storage
๐Ÿ“‹ Optional Dependencies
  • parsers - PDF, DOCX, and other document parsers
  • langchain - LangChain framework integration
  • llamaindex - LlamaIndex framework integration
  • all - All optional dependencies

๐Ÿš€ Quick Start

Zero-Config Setup (SQLite - Recommended for Getting Started)

# 1. Install RAGVersion
pip install ragversion[all]

# 2. Start tracking immediately - no configuration needed!
ragversion track ./documents

# That's it! RAGVersion uses SQLite by default (ragversion.db)

Basic Usage (Python)

import asyncio
from ragversion import AsyncVersionTracker
from ragversion.storage import SQLiteStorage

async def main():
    # Initialize tracker with SQLite (zero configuration)
    tracker = AsyncVersionTracker(
        storage=SQLiteStorage()  # Creates ragversion.db automatically
    )

    # Track a single file
    change = await tracker.track("document.pdf")
    if change:
        print(f"Document changed: {change.change_type}")

    # Track a directory (batch processing)
    result = await tracker.track_directory(
        "./documents",
        patterns=["*.pdf", "*.docx"],
        recursive=True
    )

    print(f"โœ… Processed: {len(result.successful)} files")
    print(f"โŒ Failed: {len(result.failed)} files")

asyncio.run(main())
โ˜๏ธ Cloud Setup (Supabase - For Production/Collaboration)
# 1. Install RAGVersion
pip install ragversion[all]

# 2. Set environment variables
export SUPABASE_URL="https://your-project.supabase.co"
export SUPABASE_SERVICE_KEY="your-service-key"

# 3. Configure backend
echo "storage:
  backend: supabase
  supabase:
    url: \${SUPABASE_URL}
    key: \${SUPABASE_SERVICE_KEY}" > ragversion.yaml

# 4. Initialize database
ragversion migrate

# 5. Start tracking!
ragversion track ./documents

Python usage with Supabase:

from ragversion.storage import SupabaseStorage

async def main():
    tracker = AsyncVersionTracker(
        storage=SupabaseStorage.from_env()
    )
    # ... rest of your code

โšก Framework Integration (LangChain/LlamaIndex)

NEW in v0.11.0: One-line setup for LangChain and LlamaIndex!

LangChain (3 lines!)

from ragversion.integrations.langchain import quick_start

# That's it! ๐Ÿš€
sync = await quick_start("./documents")

# Ready to query
results = await sync.vectorstore.asimilarity_search("query")

LlamaIndex (3 lines!)

from ragversion.integrations.llamaindex import quick_start

# That's it! ๐Ÿš€
sync = await quick_start("./documents")

# Ready to query
query_engine = sync.index.as_query_engine()
response = query_engine.query("query")

What quick_start() does automatically:

  • โœ… Creates and initializes RAGVersion tracker
  • โœ… Sets up vector store (FAISS/Chroma for LangChain)
  • โœ… Configures embeddings (OpenAI by default)
  • โœ… Creates text splitter with optimal defaults
  • โœ… Syncs your documents directory
  • โœ… Enables smart chunk-level tracking (80-95% cost savings!)

Before vs After:

# BEFORE: 35+ lines of boilerplate ๐Ÿ˜ฐ
storage = SupabaseStorage.from_env()
tracker = AsyncVersionTracker(storage=storage)
await tracker.initialize()
text_splitter = RecursiveCharacterTextSplitter(...)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(...)
sync = LangChainSync(tracker, text_splitter, embeddings, vectorstore)
await sync.sync_directory("./documents")

# AFTER: 3 lines ๐ŸŽ‰
from ragversion.integrations.langchain import quick_start
sync = await quick_start("./documents")
# Done!

Customization options:

# LangChain with custom settings
sync = await quick_start(
    directory="./documents",
    vectorstore_type="faiss",        # or "chroma"
    vectorstore_path="./vectorstore", # persistent storage
    storage_backend="sqlite",         # or "supabase", "auto"
    chunk_size=500,                  # custom chunk size
    enable_chunk_tracking=True,      # smart updates (default)
)

# LlamaIndex with custom settings
sync = await quick_start(
    directory="./documents",
    storage_backend="supabase",      # cloud storage
    chunk_size=2048,                 # larger chunks
    enable_chunk_tracking=False,     # disable for full re-embedding
)

๐Ÿ‘‰ See complete quick start examples


๐ŸŽ“ Complete Integration Guide

Want to integrate RAGVersion with LangChain or LlamaIndex?

๐Ÿ‘‰ Read the complete How-to Guide - Comprehensive guide with 10+ practical examples:

  • โœ… LangChain integration (basic and chunk-level)
  • โœ… LlamaIndex integration (basic and chunk-level)
  • โœ… Real-time file watching
  • โœ… Cost optimization with chunk-level versioning (80-95% savings!)
  • โœ… 4 common use cases (docs, support KB, research, multi-tenant)
  • โœ… Best practices and troubleshooting
  • โœ… Production-ready complete example

Quick Example - LangChain with Chunk Tracking:

from ragversion import AsyncVersionTracker
from ragversion.models import ChunkingConfig
from ragversion.integrations.langchain import LangChainSync

# Enable chunk tracking for 80-95% cost savings!
chunk_config = ChunkingConfig(enabled=True, chunk_size=500)
tracker = AsyncVersionTracker(
    storage=storage,
    chunk_tracking_enabled=True,
    chunk_config=chunk_config
)

# Auto-sync with LangChain - only changed chunks re-embedded!
sync = LangChainSync(
    tracker=tracker,
    embeddings=embeddings,
    vectorstore=vectorstore,
    enable_chunk_tracking=True
)
await sync.sync_directory("./docs")

โš™๏ธ Configuration

Default (SQLite) - No Configuration Required

RAGVersion works out of the box with SQLite. No setup needed!

# Just start tracking - uses ragversion.db by default
ragversion track ./documents

Custom Configuration File (Optional)

Create a ragversion.yaml file for advanced settings:

storage:
  backend: sqlite  # or "supabase" for cloud storage
  sqlite:
    db_path: ragversion.db
    content_compression: true

tracking:
  store_content: true
  max_file_size_mb: 50
  batch:
    max_workers: 4
    on_error: continue

content:
  compression: gzip
  ttl_days: 365

Switching to Supabase (Cloud Storage)

For production or team collaboration:

storage:
  backend: supabase
  supabase:
    url: ${SUPABASE_URL}
    key: ${SUPABASE_SERVICE_KEY}

Or use environment variables:

export RAGVERSION_STORAGE_BACKEND=supabase
export SUPABASE_URL="https://your-project.supabase.co"
export SUPABASE_SERVICE_KEY="your-service-key"
๐Ÿ”ง Advanced Configuration Options
# Full configuration example with all options
storage:
  backend: supabase
  supabase:
    url: ${SUPABASE_URL}
    key: ${SUPABASE_SERVICE_KEY}
    connection_timeout: 30
    retry_attempts: 3

tracking:
  store_content: true
  max_file_size_mb: 50
  hash_algorithm: sha256
  batch:
    max_workers: 4
    on_error: continue
    timeout_seconds: 300

content:
  compression: gzip
  compression_level: 6
  ttl_days: 365

notifications:
  enabled: true
  notifiers:
    - type: slack
      name: team-slack
      enabled: true
      webhook_url: ${SLACK_WEBHOOK_URL}
    - type: discord
      name: dev-discord
      enabled: true
      webhook_url: ${DISCORD_WEBHOOK_URL}
    - type: email
      name: admin-email
      enabled: true
      smtp_host: smtp.gmail.com
      smtp_port: 587
      smtp_username: ${EMAIL_USERNAME}
      smtp_password: ${EMAIL_PASSWORD}
      from_address: ragversion@company.com
      to_addresses:
        - admin@company.com

events:
  enabled: true
  handlers:
    - type: webhook
      url: https://your-webhook-url.com

โšก GitHub Actions Integration

Automatically track documentation changes in your CI/CD pipeline:

# .github/workflows/track-docs.yml
name: Track Documentation

on:
  push:
    branches: [main]
    paths: ['docs/**', '*.md']

jobs:
  track:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Track documentation with RAGVersion
        uses: sourangshupal/ragversion/.github/actions/ragversion-track@v0.4.0
        with:
          paths: 'docs/ README.md'
          storage-backend: 'sqlite'
          file-patterns: '*.md *.txt *.pdf'

Benefits:

  • โœ… Automatic tracking on every commit
  • โœ… PR documentation validation
  • โœ… Scheduled tracking jobs
  • โœ… Zero manual intervention
  • โœ… Archive tracking history as artifacts

Common Use Cases:

PR Checks

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  check-docs:
    steps:
      - uses: sourangshupal/ragversion/.github/actions/ragversion-track@v0.4.0
        with:
          paths: 'docs/'
          fail-on-error: true

Scheduled Tracking

on:
  schedule:
    - cron: '0 0 * * *'  # Daily

jobs:
  track:
    steps:
      - uses: sourangshupal/ragversion/.github/actions/ragversion-track@v0.4.0
        with:
          paths: 'docs/ examples/'
          max-workers: 8

๐Ÿ“– Full documentation: docs/GITHUB_ACTIONS.md


๐Ÿ‘€ Real-Time File Watching

Automatically track document changes without manual intervention:

# Start watching a directory
ragversion watch ./docs

# Watch only Markdown files
ragversion watch ./docs --pattern "*.md"

# Watch multiple directories
ragversion watch ./docs ./guides --pattern "*.md" --pattern "*.txt"

Features:

  • โœ… Real-time change detection (create, modify, delete)
  • โœ… Pattern matching for specific file types
  • โœ… Recursive directory watching
  • โœ… Automatic debouncing
  • โœ… Custom change callbacks
  • โœ… Daemon mode for 24/7 monitoring

Python API:

from ragversion import watch_directory

async def on_change(change):
    print(f"๐Ÿ“„ {change.change_type.value}: {change.file_name}")

async def main():
    async with AsyncVersionTracker(storage=storage) as tracker:
        await watch_directory(
            tracker,
            "./docs",
            patterns=["*.md", "*.txt"],
            on_change=on_change
        )

asyncio.run(main())

Use Cases:

  • ๐Ÿ”„ Development environment (auto-track while editing)
  • ๐Ÿš€ Production monitoring (24/7 daemon mode)
  • ๐Ÿ”” Custom notifications (Slack, email, webhooks)
  • ๐Ÿค– RAG integration (auto-update vector stores)

๐Ÿ“– Full documentation: docs/FILE_WATCHING.md


๐Ÿ”” Notifications

Get real-time alerts when documents change via Slack, Discord, Email, or custom webhooks.

# ragversion.yaml
notifications:
  enabled: true
  notifiers:
    - type: slack
      name: team-slack
      enabled: true
      webhook_url: ${SLACK_WEBHOOK_URL}
      mention_on_types: ["deleted"]  # Mention users for deletions

    - type: discord
      name: dev-discord
      enabled: true
      webhook_url: ${DISCORD_WEBHOOK_URL}

    - type: email
      name: admin-email
      enabled: true
      smtp_host: smtp.gmail.com
      smtp_port: 587
      smtp_username: ${EMAIL_USERNAME}
      smtp_password: ${EMAIL_PASSWORD}
      from_address: ragversion@company.com
      to_addresses:
        - admin@company.com

Supported Providers:

  • ๐Ÿ’ฌ Slack - Rich formatted messages with user mentions
  • ๐ŸŽฎ Discord - Embed-based notifications with role mentions
  • ๐Ÿ“ง Email - HTML/plain text via SMTP
  • ๐Ÿ”— Webhook - Custom HTTP endpoints for any integration

Features:

  • โœ… Multiple providers simultaneously
  • โœ… Parallel or sequential delivery
  • โœ… Conditional notifications (e.g., only for deletions)
  • โœ… User/role mentions
  • โœ… Custom metadata in messages
  • โœ… Automatic retry and error handling

CLI Usage:

# Notifications are sent automatically with file watching
ragversion watch ./documents --config ragversion.yaml

Python API:

from ragversion.notifications import create_notification_manager
from ragversion.config import RAGVersionConfig

# Load config with notifications
config = RAGVersionConfig.load("ragversion.yaml")
notification_manager = create_notification_manager(
    config.notifications.notifiers
)

# Create tracker with notifications
tracker = AsyncVersionTracker(
    storage=storage,
    notification_manager=notification_manager
)

async with tracker:
    await tracker.track("./documents/report.pdf")
    # Notifications sent automatically

๐Ÿ“– Full documentation: docs/NOTIFICATIONS.md ๐Ÿ“ Examples: examples/notifications/


๐Ÿ”— Integrations

RAGVersion seamlessly integrates with popular RAG frameworks:

๐Ÿฆœ LangChain

from ragversion.integrations.langchain import LangChainSync
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant

sync = LangChainSync(
    tracker=tracker,
    text_splitter=RecursiveCharacterTextSplitter(
        chunk_size=1000
    ),
    embeddings=OpenAIEmbeddings(),
    vectorstore=qdrant_client
)

# Automatically sync only changed documents
await sync.sync_directory("./documents")

Features:

  • โœ… Automatic change detection
  • โœ… Incremental vector store updates
  • โœ… Custom text splitters
  • โœ… Batch processing

๐Ÿฆ™ LlamaIndex

from ragversion.integrations.llamaindex import LlamaIndexSync
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

sync = LlamaIndexSync(
    tracker=tracker,
    node_parser=SentenceSplitter(
        chunk_size=1024
    ),
    index=vector_index
)

# Keep your index in sync effortlessly
await sync.sync_directory("./documents")

Features:

  • โœ… Native LlamaIndex integration
  • โœ… Node-level tracking
  • โœ… Custom node parsers
  • โœ… Async operations

๐ŸŽฏ Custom Integrations

RAGVersion's modular design makes it easy to integrate with any RAG framework:

from ragversion import AsyncVersionTracker

async def custom_sync(tracker, documents_path):
    result = await tracker.track_directory(documents_path)

    for change in result.successful:
        if change.change_type in ["created", "modified"]:
            # Your custom processing logic
            await process_document(change.document)

๐Ÿ–ฅ๏ธ CLI

RAGVersion includes a powerful command-line interface for managing document versions:

๐Ÿ“‹ Basic Commands

# Initialize a new project
ragversion init

# Track files or directories
ragversion track ./documents

# List tracked documents
ragversion list

# Run database migrations
ragversion migrate

๐Ÿ” Version Management

# View document history
ragversion history <document-id>

# Get document diff between versions
ragversion diff <document-id> --versions 1 2

# Show version details
ragversion show <version-id>

๐Ÿ’ก CLI Examples

# Track all PDFs in a directory recursively
ragversion track ./documents --pattern "*.pdf" --recursive

# List recently changed documents
ragversion list --recent 10

# Export version history
ragversion export --format json --output history.json

# Show configuration
ragversion config show
๐Ÿ“– See all CLI commands
ragversion --help

Commands:
  init        Initialize RAGVersion in the current directory
  track       Track files or directories for changes
  list        List tracked documents
  history     Show version history for a document
  diff        Show differences between versions
  show        Show detailed version information
  migrate     Run database migrations
  config      Manage configuration
  export      Export version history
  import      Import version history
  status      Show tracking status
  validate    Validate configuration

๐Ÿ–ฅ๏ธ Web Interface

RAGVersion includes a simple, clean web interface perfect for content teams and non-technical users:

# Start the server (includes web UI + REST API)
ragversion serve

# Access the web interface
# Dashboard: http://localhost:6699/
# Documents: http://localhost:6699/documents

Web UI Features:

  • ๐Ÿ“Š Dashboard - Statistics overview, top documents, file type distribution
  • ๐Ÿ“„ Document Browser - Search, filter, and browse all tracked documents
  • ๐Ÿ“ˆ Version History - View complete version timeline for each document
  • ๐Ÿ” Visual Diff Viewer - Compare versions with color-coded changes
  • ๐ŸŽจ Clean Design - Modern, responsive interface with intuitive navigation
  • ๐Ÿš€ Fast & Lightweight - Server-side rendering, no heavy JavaScript frameworks

Perfect for:

  • Content managers who need to track document changes visually
  • Non-technical stakeholders who want quick insights
  • Teams that prefer web interfaces over command-line tools
  • Quick browsing and searching through document history

Screenshots:

Dashboard View:

  • Total documents, versions, storage used
  • Recent activity metrics
  • Top documents by version count
  • File type distribution chart

Document Detail:

  • Complete version history
  • Change statistics and frequency
  • Visual badges for change types
  • Version comparison links

๐ŸŒ REST API

RAGVersion also provides a comprehensive REST API for programmatic access from any language or platform:

# Start the API server (same command as web UI)
ragversion serve

# Custom host and port
ragversion serve --host localhost --port 5000

# Development mode with auto-reload
ragversion serve --reload

API Features:

  • ๐Ÿš€ FastAPI-based - Modern async web framework
  • ๐Ÿ“– Auto documentation - Swagger UI at /api/docs, ReDoc at /api/redoc
  • ๐Ÿ” Optional auth - API key authentication via X-API-Key header
  • ๐ŸŒ CORS support - Configurable cross-origin requests
  • โšก Async operations - Non-blocking request handling
  • โœ… Type validation - Automatic request/response validation with Pydantic

Quick API Examples

Python:

import requests

BASE_URL = "http://localhost:6699/api"

# Track a file
response = requests.post(
    f"{BASE_URL}/track/file",
    json={"file_path": "/path/to/doc.pdf"}
)
event = response.json()

# List documents
docs = requests.get(
    f"{BASE_URL}/documents?limit=10"
).json()

# Get statistics
stats = requests.get(
    f"{BASE_URL}/statistics"
).json()

JavaScript:

const BASE_URL = "http://localhost:6699/api";

// Track a file
const response = await fetch(
  `${BASE_URL}/track/file`,
  {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({
      file_path: "/path/to/doc.pdf"
    })
  }
);
const event = await response.json();

// Get version history
const versions = await fetch(
  `${BASE_URL}/versions/document/${docId}`
).then(r => r.json());

cURL Examples:

# Track directory
curl -X POST http://localhost:6699/api/track/directory \
  -H "Content-Type: application/json" \
  -d '{"dir_path": "/docs", "patterns": ["*.md"]}'

# Get diff between versions
curl "http://localhost:6699/api/versions/document/<doc-id>/diff/1/3"

# Health check
curl http://localhost:6699/api/health

API Endpoints:

  • /api/documents - Document management (list, get, search, delete)
  • /api/versions - Version management (list, get, content, diff, restore)
  • /api/track - Tracking operations (file, directory)
  • /api/statistics - Analytics and statistics
  • /api/health - Server health check

See the API Guide for complete documentation.


โฐ Batch Processing & Automation

Cron Job Example

Create a scheduled sync script:

#!/usr/bin/env python3
"""sync_documents.py - Cron job to sync documents"""

import asyncio
import logging
from ragversion import AsyncVersionTracker, SupabaseStorage

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def sync_documents():
    """Sync all documents in the directory"""
    tracker = AsyncVersionTracker(
        storage=SupabaseStorage.from_env()
    )

    result = await tracker.track_directory(
        "./documents",
        patterns=["*.pdf", "*.docx"],
        recursive=True
    )

    logger.info(f"โœ… Synced {len(result.successful)} documents")

    if result.failed:
        logger.error(f"โŒ Failed to process {len(result.failed)} documents")
        for error in result.failed:
            logger.error(f"  - {error.file_path}: {error.error}")

if __name__ == "__main__":
    asyncio.run(sync_documents())

Schedule with Crontab

# Edit crontab
crontab -e

# Add this line to sync every hour
0 * * * * /path/to/venv/bin/python /path/to/sync_documents.py >> /var/log/ragversion.log 2>&1

# Or sync every 15 minutes
*/15 * * * * /path/to/venv/bin/python /path/to/sync_documents.py >> /var/log/ragversion.log 2>&1

Use with GitHub Actions

name: Sync Documents

on:
  schedule:
    - cron: '0 * * * *'  # Every hour
  workflow_dispatch:

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install ragversion[all]
      - name: Sync documents
        env:
          SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
          SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}
        run: ragversion track ./documents

๐Ÿ—๏ธ Architecture

RAGVersion follows a modular, async-first architecture designed for production systems:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AsyncVersionTracker                      โ”‚
โ”‚                    (Core Tracking Engine)                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚                           โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚   Storage Backends     โ”‚   โ”‚   Document Parsers     โ”‚
    โ”‚  - Supabase (current)  โ”‚   โ”‚  - PDF, DOCX, TXT      โ”‚
    โ”‚  - PostgreSQL (future) โ”‚   โ”‚  - Markdown, CSV       โ”‚
    โ”‚  - SQLite (future)     โ”‚   โ”‚  - Pluggable system    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚         Core Components                โ”‚
    โ”‚  โ€ข Change Detector (hashing & diffs)   โ”‚
    โ”‚  โ€ข Event System (async callbacks)      โ”‚
    โ”‚  โ€ข Batch Processor (error handling)    โ”‚
    โ”‚  โ€ข Compression & Storage optimization  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

Component Description Status
AsyncVersionTracker Core tracking engine with async/await support โœ… Stable
Storage Backends Abstract storage interface (Supabase implemented) โœ… Stable
Document Parsers Pluggable parsers for various file formats โœ… Stable
Change Detector Content hashing and intelligent diff generation โœ… Stable
Event System Async callbacks for change notifications โœ… Stable
Batch Processor Resilient batch processing with error recovery โœ… Stable

๐Ÿ›ก๏ธ Error Handling

RAGVersion uses a continue-on-error approach designed for production resilience:

result = await tracker.track_directory("./documents")

# Detailed error reporting
print(f"โœ… Successful: {len(result.successful)}")
print(f"โŒ Failed: {len(result.failed)}")

# Handle failures gracefully
if result.failed:
    for error in result.failed:
        print(f"Failed: {error.file_path}")
        print(f"Reason: {error.error}")
        print(f"Type: {error.error_type}")  # "parsing" | "storage" | "unknown"

        # Retry logic for specific error types
        if error.error_type == "parsing":
            # Handle parsing errors
            pass
        elif error.error_type == "storage":
            # Handle storage errors
            pass

Error Types

Error Type Description Recommended Action
parsing Failed to parse document content Check file format, update parsers
storage Failed to save to database Check connection, retry
validation Invalid configuration or input Fix configuration
unknown Unexpected error Review logs, report issue

๐Ÿงช Testing

RAGVersion includes testing utilities for integration tests:

from ragversion.testing import MockStorage, create_sample_documents

async def test_integration():
    # Use in-memory mock storage for testing
    tracker = AsyncVersionTracker(storage=MockStorage())

    # Generate sample test documents
    docs = create_sample_documents(count=10, file_type="pdf")

    # Test your integration
    results = []
    for doc in docs:
        result = await tracker.track(doc.path)
        results.append(result)

    # Assertions
    assert len(results) == 10
    assert all(r.change_type == "created" for r in results)

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=ragversion --cov-report=html

# Run specific test file
pytest tests/test_tracker.py

๐Ÿ“š Documentation

๐Ÿ“– Complete Guide

DOCUMENTATION.md - Comprehensive documentation covering:

  • โœ… Complete feature walkthrough
  • โœ… Integration guides (LangChain, LlamaIndex)
  • โœ… API reference
  • โœ… Advanced use cases
  • โœ… Best practices
  • โœ… Troubleshooting
  • โœ… Architecture deep dive

๐Ÿš€ Roadmap

future-enhancements.md - What's coming next:

  • ๐Ÿ”ฎ New framework integrations
  • ๐Ÿ”ฎ Storage backend expansions
  • ๐Ÿ”ฎ Advanced document parsers
  • ๐Ÿ”ฎ Enterprise features
  • ๐Ÿ”ฎ Performance optimizations
  • ๐Ÿ”ฎ Security enhancements

๐Ÿ’ผ Use Cases

๐Ÿ“š Documentation Versioning

Track changes to documentation sites and keep chatbots up-to-date:

# Monitor docs directory and update vector store
async def monitor_docs():
    tracker = AsyncVersionTracker(storage=SupabaseStorage.from_env())
    sync = LangChainSync(tracker=tracker, vectorstore=qdrant)

    while True:
        result = await sync.sync_directory("./docs")
        print(f"Updated {len(result.successful)} documents")
        await asyncio.sleep(300)  # Check every 5 minutes
๐Ÿค– AI Chatbot Knowledge Base

Maintain fresh knowledge bases for AI assistants:

# Sync changed documents to chatbot's knowledge base
async def update_chatbot_kb():
    tracker = AsyncVersionTracker(storage=SupabaseStorage.from_env())
    result = await tracker.track_directory("./knowledge-base")

    for change in result.successful:
        if change.change_type in ["created", "modified"]:
            await chatbot.update_knowledge(change.document)
๐Ÿ“Š Data Pipeline Monitoring

Track document changes in data processing pipelines:

# Monitor source documents and trigger pipeline
async def pipeline_monitor():
    tracker = AsyncVersionTracker(storage=SupabaseStorage.from_env())

    result = await tracker.track_directory("./data/input")

    # Trigger processing only for changed files
    for change in result.successful:
        if change.change_type != "unchanged":
            await trigger_pipeline(change.document)
๐Ÿ” Compliance & Audit Trails

Maintain complete audit trails of document changes:

# Track all changes with full history
async def audit_documents():
    tracker = AsyncVersionTracker(storage=SupabaseStorage.from_env())

    # Get complete version history
    history = await tracker.get_history(document_id)

    for version in history:
        print(f"{version.timestamp}: {version.change_type}")
        print(f"Content hash: {version.content_hash}")

โšก Performance

RAGVersion is built for production scale:

Metric Performance
Batch Processing 100+ docs/second
Memory Footprint < 50MB base
Storage Overhead ~10% (with compression)
Async Operations Non-blocking I/O
Scalability Horizontal scaling ready

Optimization Tips

# Use batch processing for large directories
result = await tracker.track_directory(
    "./documents",
    batch_size=50,  # Process 50 files at a time
    max_workers=4   # Use 4 parallel workers
)

# Enable compression to reduce storage
tracker = AsyncVersionTracker(
    storage=SupabaseStorage.from_env(),
    compression="gzip"  # or "zstd" for better compression
)

๐Ÿ“‹ Requirements

  • Python: 3.9+
  • Database: Supabase account (free tier available at supabase.com)
  • Optional: Redis for caching (future feature)

๐Ÿ“œ License

RAGVersion is released under the MIT License. See LICENSE file for details.

MIT License - Free for personal and commercial use
โœ… Private use   โœ… Commercial use   โœ… Modification   โœ… Distribution

๐Ÿค Contributing

We welcome contributions! Here's how you can help:

๐Ÿ› Report Bugs

Found a bug? Open an issue

โœจ Request Features

Have an idea? Start a discussion

๐Ÿ”ง Submit PRs

Want to contribute code? Read guidelines

Quick Links:


๐ŸŒŸ Show Your Support

If you find RAGVersion helpful, please consider:

  • โญ Starring this repository
  • ๐Ÿฆ Sharing on social media
  • ๐Ÿ“ Writing a blog post about your experience
  • ๐Ÿ’ฌ Contributing to discussions
  • ๐Ÿ› Reporting bugs or suggesting features

๐Ÿ“ž Support & Community

๐Ÿ“– Documentation

Read Docs

๐Ÿ› Issues

Report Bug

๐Ÿ’ฌ Discussions

Join Discussion

๐Ÿ“ฆ PyPI

View Package


๐Ÿ—บ๏ธ Roadmap

Check out our detailed roadmap to see what's coming next!

High Priority Features:

  • ๐Ÿ”„ Real-time file watching
  • ๐Ÿ’พ SQLite & PostgreSQL backends
  • ๐Ÿ”— Haystack & Weaviate integrations
  • ๐ŸŒ REST API server
  • ๐Ÿ–ฅ๏ธ Web UI dashboard
  • ๐Ÿ”’ Enterprise security features

๐Ÿ“Š Project Stats

GitHub repo size GitHub code size Lines of code


Made with โค๏ธ by the RAGVersion Team

โญ Star on GitHub โ€ข ๐Ÿ“ฆ Install from PyPI โ€ข ๐Ÿ“– Read the Docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragversion-0.11.0.tar.gz (624.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragversion-0.11.0-py3-none-any.whl (117.4 kB view details)

Uploaded Python 3

File details

Details for the file ragversion-0.11.0.tar.gz.

File metadata

  • Download URL: ragversion-0.11.0.tar.gz
  • Upload date:
  • Size: 624.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for ragversion-0.11.0.tar.gz
Algorithm Hash digest
SHA256 9c655f73804a4a137233524b234b4b8da75f642dc502bb7803e65a23e6282d78
MD5 d3ca779e526cc13f01446adb312b14a1
BLAKE2b-256 e917684b208facece6735b9c90752d162432bd65df3e69107426532696569dcd

See more details on using hashes here.

File details

Details for the file ragversion-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: ragversion-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 117.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for ragversion-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af2b50842f74bdd30b10503a3ada01b7e4db9362403f3f5cbf39d0a1aae7866e
MD5 7093000a19fd99de7e553a33071b79a9
BLAKE2b-256 f34a595fdb55eaad1c74105bbb56bd21f0dd6bd80fb48292bd969f2b7c19ab8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page