Skip to main content

OCR Vector Database - Document parsing, semantic segmentation, and vector search

Project description

OCR Vector DB

A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.

Features

  • Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
  • Semantic segmentation: Intelligent grouping of text, code, and images
  • Multi-view embeddings: Separate embeddings for text, code, images, tables
  • Parent-child hierarchy: Context-aware retrieval with parent documents
  • RAG support: LLM-powered question answering over your documents
  • PostgreSQL + pgvector: Scalable vector storage with HNSW indexing

Installation

From PyPI

pip install ocr-vector-db

Using uv (recommended)

uv add ocr-vector-db

Prerequisites

  • Python 3.12+
  • PostgreSQL with pgvector extension
  • Google API key (for Gemini embeddings/LLM) or Voyage API key

Database Setup

Start PostgreSQL with pgvector using Docker:

docker run -d \
  --name pgvector \
  -e POSTGRES_USER=langchain \
  -e POSTGRES_PASSWORD=langchain \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Or use docker-compose (if provided in the repository).

Quick Start

1. Configure Environment

Create a .env file:

# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini

2. Ingest Documents

# Ingest PDF files
myrag ingest documents/*.pdf

# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run

3. Search

# Direct search
myrag search "vector database optimization"

# Search with filters
myrag search "async function" --view code --language javascript

# JSON output
myrag search "machine learning" --json

4. RAG (Question Answering)

# Ask a question
myrag rag "What is the main topic of this document?"

# With sources
myrag rag "How does the authentication work?" --sources

5. Interactive REPL

# Start search REPL
myrag

# Start RAG REPL
myrag --rag

CLI Commands

myrag (default)

Start the interactive REPL for search or RAG queries.

myrag              # Search mode
myrag --rag        # RAG mode (LLM-powered)
myrag --view code  # Default filter

myrag search

Run a single search query.

myrag search "query" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of results (default: 5)
  --no-context          Disable parent context expansion
  --json                Output JSON format
  --verbose, -v         Enable verbose logging

myrag ingest

Ingest documents into the vector database.

myrag ingest FILE [FILE ...] [options]

Options:
  --dry-run      Parse only, no database writes
  --no-cache     Disable OCR cache (re-process all pages)

myrag rag

Ask a question using RAG (Retrieval-Augmented Generation).

myrag rag "question" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of context documents (default: 5)
  --sources             Show source documents in response
  --verbose, -v         Enable verbose logging

myrag quality

Inspect data quality and statistics.

myrag quality

Output includes:

  • Document/concept/fragment/embedding counts
  • View distribution
  • Orphan entity check

Configuration

All settings are configured via environment variables. See .env.example for the complete list.

Key Settings

Variable Description Default
PG_CONN PostgreSQL connection string (required)
COLLECTION_NAME Vector store collection name (required)
EMBEDDING_PROVIDER gemini or voyage voyage
GOOGLE_API_KEY Google API key for Gemini -
VOYAGE_API_KEY Voyage AI API key -
EMBEDDING_DIM Embedding dimension 768
PARENT_MODE Grouping mode: unit, page, section, page_section unit
ENABLE_IMAGE_OCR Enable Gemini Vision OCR for PDFs true

Architecture

Document -> Concept -> Fragment -> Embedding
  • Document: Source file (PDF, Markdown, text)
  • Concept: Semantic unit (related paragraphs, code blocks)
  • Fragment: Individual piece of content (text paragraph, code block, image)
  • Embedding: Vector representation for similarity search

Multi-View Strategy

Documents are segmented into distinct views:

  • text: Natural language paragraphs
  • code: Code blocks with language detection
  • image: Image references with alt text
  • caption: Figure/table captions
  • table, figure: Structured content

Each view can be filtered during search for targeted retrieval.

Development

Install from Source (uv)

git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db

# Install with uv (recommended)
uv sync --all-extras

# Or with pip
pip install -e ".[dev,test]"

Build Package

# Using uv
uv build

# Or using pip
python -m build

Run CLI (development)

# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf

# Or after pip install -e .
myrag --help

Run Tests

uv run pytest
# or
pytest

Sync Lock File

uv lock

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jw_my_rag-0.1.1.tar.gz (394.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jw_my_rag-0.1.1-py3-none-any.whl (99.2 kB view details)

Uploaded Python 3

File details

Details for the file jw_my_rag-0.1.1.tar.gz.

File metadata

  • Download URL: jw_my_rag-0.1.1.tar.gz
  • Upload date:
  • Size: 394.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e3c78a33b185c3e1659a1df472004a3333b77b92853f3df2e5b8be454b8bd3fa
MD5 9308ef56975c40e91ab3d06ccc6b305d
BLAKE2b-256 25187bbe4a95c9f538bc3f5ff7d74dec6f443f7eb29c5ef040d32d6444735615

See more details on using hashes here.

File details

Details for the file jw_my_rag-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: jw_my_rag-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 99.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9c29fb590f24d1c7aeb9506c80790319706bd94f9566406f7733bf8aa3a3a26d
MD5 1d98b5cd1c3bb0102a13fdf529c24605
BLAKE2b-256 db1b456b5064ba768f7cc37be666e11161c15a690ff4c3006369e82de3b44fde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page