Skip to main content

OCR Vector Database - Document parsing, semantic segmentation, and vector search

Project description

OCR Vector DB

A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.

Features

  • Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
  • Semantic segmentation: Intelligent grouping of text, code, and images
  • Multi-view embeddings: Separate embeddings for text, code, images, tables
  • Parent-child hierarchy: Context-aware retrieval with parent documents
  • RAG support: LLM-powered question answering over your documents
  • PostgreSQL + pgvector: Scalable vector storage with HNSW indexing

Installation

From PyPI

pip install ocr-vector-db

Using uv (recommended)

uv add ocr-vector-db

Prerequisites

  • Python 3.12+
  • PostgreSQL with pgvector extension
  • Google API key (for Gemini embeddings/LLM) or Voyage API key

Database Setup

Start PostgreSQL with pgvector using Docker:

docker run -d \
  --name pgvector \
  -e POSTGRES_USER=langchain \
  -e POSTGRES_PASSWORD=langchain \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Or use docker-compose (if provided in the repository).

Quick Start

1. Configure Environment

Create a .env file:

# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini

2. Ingest Documents

# Ingest PDF files
myrag ingest documents/*.pdf

# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run

3. Search

# Direct search
myrag search "vector database optimization"

# Search with filters
myrag search "async function" --view code --language javascript

# JSON output
myrag search "machine learning" --json

4. RAG (Question Answering)

# Ask a question
myrag rag "What is the main topic of this document?"

# With sources
myrag rag "How does the authentication work?" --sources

5. Interactive REPL

# Start search REPL
myrag

# Start RAG REPL
myrag --rag

CLI Commands

myrag (default)

Start the interactive REPL for search or RAG queries.

myrag              # Search mode
myrag --rag        # RAG mode (LLM-powered)
myrag --view code  # Default filter

myrag search

Run a single search query.

myrag search "query" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of results (default: 5)
  --no-context          Disable parent context expansion
  --json                Output JSON format
  --verbose, -v         Enable verbose logging

myrag ingest

Ingest documents into the vector database.

myrag ingest FILE [FILE ...] [options]

Options:
  --dry-run      Parse only, no database writes
  --no-cache     Disable OCR cache (re-process all pages)

myrag rag

Ask a question using RAG (Retrieval-Augmented Generation).

myrag rag "question" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of context documents (default: 5)
  --sources             Show source documents in response
  --verbose, -v         Enable verbose logging

myrag quality

Inspect data quality and statistics.

myrag quality

Output includes:

  • Document/concept/fragment/embedding counts
  • View distribution
  • Orphan entity check

Configuration

All settings are configured via environment variables. See .env.example for the complete list.

Key Settings

Variable Description Default
PG_CONN PostgreSQL connection string (required)
COLLECTION_NAME Vector store collection name (required)
EMBEDDING_PROVIDER gemini or voyage voyage
GOOGLE_API_KEY Google API key for Gemini -
VOYAGE_API_KEY Voyage AI API key -
EMBEDDING_DIM Embedding dimension 768
PARENT_MODE Grouping mode: unit, page, section, page_section unit
ENABLE_IMAGE_OCR Enable Gemini Vision OCR for PDFs true

Architecture

Document -> Concept -> Fragment -> Embedding
  • Document: Source file (PDF, Markdown, text)
  • Concept: Semantic unit (related paragraphs, code blocks)
  • Fragment: Individual piece of content (text paragraph, code block, image)
  • Embedding: Vector representation for similarity search

Multi-View Strategy

Documents are segmented into distinct views:

  • text: Natural language paragraphs
  • code: Code blocks with language detection
  • image: Image references with alt text
  • caption: Figure/table captions
  • table, figure: Structured content

Each view can be filtered during search for targeted retrieval.

Development

Install from Source (uv)

git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db

# Install with uv (recommended)
uv sync --all-extras

# Or with pip
pip install -e ".[dev,test]"

Build Package

# Using uv
uv build

# Or using pip
python -m build

Run CLI (development)

# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf

# Or after pip install -e .
myrag --help

Run Tests

uv run pytest
# or
pytest

Sync Lock File

uv lock

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jw_my_rag-0.1.0.tar.gz (394.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jw_my_rag-0.1.0-py3-none-any.whl (99.2 kB view details)

Uploaded Python 3

File details

Details for the file jw_my_rag-0.1.0.tar.gz.

File metadata

  • Download URL: jw_my_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 394.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a1e7d3c0283127b7466208710731db3145ce48f6e9059e8b63374790d8088f40
MD5 23237a09c7252703e0bc47bc4a041237
BLAKE2b-256 ecc35f6de1ce77cdf07b1864d382288da9142beae0efdbfee12531584489cae3

See more details on using hashes here.

File details

Details for the file jw_my_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: jw_my_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 99.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7f0e00ad2371acaf3c78c05ebb092943eb577de4c46cbd41613eff304eab191
MD5 38d42e1fb99ab9b0f0308d80c70d7aee
BLAKE2b-256 16c4e22e13ba0fa620bb6ac3a0e1cfc90df3364ba1564daa3d57ffeddbc98e66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page