OCR Vector Database - Document parsing, semantic segmentation, and vector search

These details have not been verified by PyPI

Project links

Project description

OCR Vector DB

A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.

Features

Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
Semantic segmentation: Intelligent grouping of text, code, and images
Multi-view embeddings: Separate embeddings for text, code, images, tables
Parent-child hierarchy: Context-aware retrieval with parent documents
RAG support: LLM-powered question answering over your documents
PostgreSQL + pgvector: Scalable vector storage with HNSW indexing

Installation

From PyPI

pip install ocr-vector-db

Using uv (recommended)

uv add ocr-vector-db

Prerequisites

Python 3.12+
PostgreSQL with pgvector extension
Google API key (for Gemini embeddings/LLM) or Voyage API key

Database Setup

Start PostgreSQL with pgvector using Docker:

docker run -d \
  --name pgvector \
  -e POSTGRES_USER=langchain \
  -e POSTGRES_PASSWORD=langchain \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Or use docker-compose (if provided in the repository).

Quick Start

1. Configure Environment

Create a .env file:

# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini

2. Ingest Documents

# Ingest PDF files
myrag ingest documents/*.pdf

# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run

3. Search

# Direct search
myrag search "vector database optimization"

# Search with filters
myrag search "async function" --view code --language javascript

# JSON output
myrag search "machine learning" --json

4. RAG (Question Answering)

# Ask a question
myrag rag "What is the main topic of this document?"

# With sources
myrag rag "How does the authentication work?" --sources

5. Interactive REPL

# Start search REPL
myrag

# Start RAG REPL
myrag --rag

CLI Commands

`myrag` (default)

Start the interactive REPL for search or RAG queries.

myrag              # Search mode
myrag --rag        # RAG mode (LLM-powered)
myrag --view code  # Default filter

`myrag search`

Run a single search query.

myrag search "query" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of results (default: 5)
  --no-context          Disable parent context expansion
  --json                Output JSON format
  --verbose, -v         Enable verbose logging

`myrag ingest`

Ingest documents into the vector database.

myrag ingest FILE [FILE ...] [options]

Options:
  --dry-run      Parse only, no database writes
  --no-cache     Disable OCR cache (re-process all pages)

`myrag rag`

Ask a question using RAG (Retrieval-Augmented Generation).

myrag rag "question" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of context documents (default: 5)
  --sources             Show source documents in response
  --verbose, -v         Enable verbose logging

`myrag quality`

Inspect data quality and statistics.

myrag quality

Output includes:

Document/concept/fragment/embedding counts
View distribution
Orphan entity check

Configuration

All settings are configured via environment variables. See .env.example for the complete list.

Key Settings

Variable	Description	Default
`PG_CONN`	PostgreSQL connection string	(required)
`COLLECTION_NAME`	Vector store collection name	(required)
`EMBEDDING_PROVIDER`	`gemini` or `voyage`	`voyage`
`GOOGLE_API_KEY`	Google API key for Gemini	-
`VOYAGE_API_KEY`	Voyage AI API key	-
`EMBEDDING_DIM`	Embedding dimension	768
`PARENT_MODE`	Grouping mode: `unit`, `page`, `section`, `page_section`	`unit`
`ENABLE_IMAGE_OCR`	Enable Gemini Vision OCR for PDFs	`true`

Architecture

Document -> Concept -> Fragment -> Embedding

Document: Source file (PDF, Markdown, text)
Concept: Semantic unit (related paragraphs, code blocks)
Fragment: Individual piece of content (text paragraph, code block, image)
Embedding: Vector representation for similarity search

Multi-View Strategy

Documents are segmented into distinct views:

text: Natural language paragraphs
code: Code blocks with language detection
image: Image references with alt text
caption: Figure/table captions
table, figure: Structured content

Each view can be filtered during search for targeted retrieval.

Development

Install from Source (uv)

git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db

# Install with uv (recommended)
uv sync --all-extras

# Or with pip
pip install -e ".[dev,test]"

Build Package

# Using uv
uv build

# Or using pip
python -m build

Run CLI (development)

# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf

# Or after pip install -e .
myrag --help

Run Tests

uv run pytest
# or
pytest

Sync Lock File

uv lock

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jan 26, 2026

This version

0.1.0

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jw_my_rag-0.1.0.tar.gz (394.3 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jw_my_rag-0.1.0-py3-none-any.whl (99.2 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file jw_my_rag-0.1.0.tar.gz.

File metadata

Download URL: jw_my_rag-0.1.0.tar.gz
Upload date: Jan 26, 2026
Size: 394.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a1e7d3c0283127b7466208710731db3145ce48f6e9059e8b63374790d8088f40`
MD5	`23237a09c7252703e0bc47bc4a041237`
BLAKE2b-256	`ecc35f6de1ce77cdf07b1864d382288da9142beae0efdbfee12531584489cae3`

See more details on using hashes here.

File details

Details for the file jw_my_rag-0.1.0-py3-none-any.whl.

File metadata

Download URL: jw_my_rag-0.1.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 99.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for jw_my_rag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7f0e00ad2371acaf3c78c05ebb092943eb577de4c46cbd41613eff304eab191`
MD5	`38d42e1fb99ab9b0f0308d80c70d7aee`
BLAKE2b-256	`16c4e22e13ba0fa620bb6ac3a0e1cfc90df3364ba1564daa3d57ffeddbc98e66`

See more details on using hashes here.

jw-my-rag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OCR Vector DB

Features

Installation

From PyPI

Using uv (recommended)

Prerequisites

Database Setup

Quick Start

1. Configure Environment

2. Ingest Documents

3. Search

4. RAG (Question Answering)

5. Interactive REPL

CLI Commands

myrag (default)

myrag search

myrag ingest

myrag rag

myrag quality

Configuration

Key Settings

Architecture

Multi-View Strategy

Development

Install from Source (uv)

Build Package

Run CLI (development)

Run Tests

Sync Lock File

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`myrag` (default)

`myrag search`

`myrag ingest`

`myrag rag`

`myrag quality`