OCR Vector Database - Document parsing, semantic segmentation, and vector search
Project description
OCR Vector DB
A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.
Features
- Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
- Semantic segmentation: Intelligent grouping of text, code, and images
- Multi-view embeddings: Separate embeddings for text, code, images, tables
- Parent-child hierarchy: Context-aware retrieval with parent documents
- RAG support: LLM-powered question answering over your documents
- PostgreSQL + pgvector: Scalable vector storage with HNSW indexing
Installation
From PyPI
pip install ocr-vector-db
Using uv (recommended)
uv add ocr-vector-db
Prerequisites
- Python 3.12+
- PostgreSQL with pgvector extension
- Google API key (for Gemini embeddings/LLM) or Voyage API key
Database Setup
Start PostgreSQL with pgvector using Docker:
docker run -d \
--name pgvector \
-e POSTGRES_USER=langchain \
-e POSTGRES_PASSWORD=langchain \
-e POSTGRES_DB=vectordb \
-p 5432:5432 \
pgvector/pgvector:pg16
Or use docker-compose (if provided in the repository).
Quick Start
1. Configure Environment
Create a .env file:
# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini
2. Ingest Documents
# Ingest PDF files
myrag ingest documents/*.pdf
# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run
3. Search
# Direct search
myrag search "vector database optimization"
# Search with filters
myrag search "async function" --view code --language javascript
# JSON output
myrag search "machine learning" --json
4. RAG (Question Answering)
# Ask a question
myrag rag "What is the main topic of this document?"
# With sources
myrag rag "How does the authentication work?" --sources
5. Interactive REPL
# Start search REPL
myrag
# Start RAG REPL
myrag --rag
CLI Commands
myrag (default)
Start the interactive REPL for search or RAG queries.
myrag # Search mode
myrag --rag # RAG mode (LLM-powered)
myrag --view code # Default filter
myrag search
Run a single search query.
myrag search "query" [options]
Options:
--view {text,code,image,caption,table,figure} Filter by content type
--language LANG Filter by programming language
--top-k N, -k N Number of results (default: 5)
--no-context Disable parent context expansion
--json Output JSON format
--verbose, -v Enable verbose logging
myrag ingest
Ingest documents into the vector database.
myrag ingest FILE [FILE ...] [options]
Options:
--dry-run Parse only, no database writes
--no-cache Disable OCR cache (re-process all pages)
myrag rag
Ask a question using RAG (Retrieval-Augmented Generation).
myrag rag "question" [options]
Options:
--view {text,code,image,caption,table,figure} Filter by content type
--language LANG Filter by programming language
--top-k N, -k N Number of context documents (default: 5)
--sources Show source documents in response
--verbose, -v Enable verbose logging
myrag quality
Inspect data quality and statistics.
myrag quality
Output includes:
- Document/concept/fragment/embedding counts
- View distribution
- Orphan entity check
Configuration
All settings are configured via environment variables. See .env.example for the complete list.
Key Settings
| Variable | Description | Default |
|---|---|---|
PG_CONN |
PostgreSQL connection string | (required) |
COLLECTION_NAME |
Vector store collection name | (required) |
EMBEDDING_PROVIDER |
gemini or voyage |
voyage |
GOOGLE_API_KEY |
Google API key for Gemini | - |
VOYAGE_API_KEY |
Voyage AI API key | - |
EMBEDDING_DIM |
Embedding dimension | 768 |
PARENT_MODE |
Grouping mode: unit, page, section, page_section |
unit |
ENABLE_IMAGE_OCR |
Enable Gemini Vision OCR for PDFs | true |
Architecture
Document -> Concept -> Fragment -> Embedding
- Document: Source file (PDF, Markdown, text)
- Concept: Semantic unit (related paragraphs, code blocks)
- Fragment: Individual piece of content (text paragraph, code block, image)
- Embedding: Vector representation for similarity search
Multi-View Strategy
Documents are segmented into distinct views:
text: Natural language paragraphscode: Code blocks with language detectionimage: Image references with alt textcaption: Figure/table captionstable,figure: Structured content
Each view can be filtered during search for targeted retrieval.
Development
Install from Source (uv)
git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db
# Install with uv (recommended)
uv sync --all-extras
# Or with pip
pip install -e ".[dev,test]"
Build Package
# Using uv
uv build
# Or using pip
python -m build
Run CLI (development)
# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf
# Or after pip install -e .
myrag --help
Run Tests
uv run pytest
# or
pytest
Sync Lock File
uv lock
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jw_my_rag-0.1.1.tar.gz.
File metadata
- Download URL: jw_my_rag-0.1.1.tar.gz
- Upload date:
- Size: 394.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3c78a33b185c3e1659a1df472004a3333b77b92853f3df2e5b8be454b8bd3fa
|
|
| MD5 |
9308ef56975c40e91ab3d06ccc6b305d
|
|
| BLAKE2b-256 |
25187bbe4a95c9f538bc3f5ff7d74dec6f443f7eb29c5ef040d32d6444735615
|
File details
Details for the file jw_my_rag-0.1.1-py3-none-any.whl.
File metadata
- Download URL: jw_my_rag-0.1.1-py3-none-any.whl
- Upload date:
- Size: 99.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c29fb590f24d1c7aeb9506c80790319706bd94f9566406f7733bf8aa3a3a26d
|
|
| MD5 |
1d98b5cd1c3bb0102a13fdf529c24605
|
|
| BLAKE2b-256 |
db1b456b5064ba768f7cc37be666e11161c15a690ff4c3006369e82de3b44fde
|