Skip to main content

Repository of docling documents for RAG

Project description

Docs2DB

Build a RAG database from documents. Docs2DB processes documents into chunks and embeddings, loads them into PostgreSQL with pgvector, and produces portable SQL dumps.

What it does:

  • Ingests documents (PDF, DOCX, XLSX, HTML, MD, CSV, etc.) using Docling
  • Generates contextual chunks with LLM assistance
  • Creates embeddings (Granite 30M by default)
  • Loads into PostgreSQL with pgvector
  • Produces portable ragdb_dump.sql files

What it's for:

  • Creating databases for RAG systems that use docs2db-api

Installation

uv tool install docs2db

Requirements: Docker or Podman (for database management)

Quickstart

One command:

docs2db pipeline /path/to/your/documents

This starts a database, processes everything, and creates ragdb_dump.sql.

Next steps: See docs2db-api to use your database for RAG search. Follow one of its demos to use it with Llama Stack or integrate it into your agent.

Database Configuration

Configuration precedence (highest to lowest):

  1. CLI arguments: --host, --port, --db, --user, --password
  2. Environment variables: POSTGRES_HOST, POSTGRES_PORT, POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
  3. DATABASE_URL: postgresql://user:pass@host:port/database
  4. postgres-compose.yml in current directory
  5. Defaults: localhost:5432, user=postgres, password=postgres, db=ragdb

Examples:

# Use defaults (docs2db db-start creates everything)
docs2db load

# Environment variables
export POSTGRES_HOST=prod.example.com
export POSTGRES_DB=mydb
docs2db load

# DATABASE_URL (cloud providers)
export DATABASE_URL="postgresql://user:pass@host:5432/db"
docs2db load

# CLI arguments
docs2db load --host localhost --db mydb

Note: Don't mix DATABASE_URL with individual POSTGRES_* variables.

Commands

Database Lifecycle

docs2db db-start      # Start PostgreSQL (Docker/Podman)
docs2db db-stop       # Stop PostgreSQL
docs2db db-logs       # View logs (-f to follow)
docs2db db-destroy    # Delete all data (prompts for confirmation)
docs2db db-status     # Check connection and stats

Pipeline

docs2db pipeline <path>              # Complete workflow
docs2db pipeline <path> \
  --output-file my-rag.sql \         # Custom output
  --skip-context \                   # Skip contextual chunks (faster)
  --model e5-small-v2                # Different embedding model

Individual Steps

These are the same steps pipeline runs.

docs2db ingest <path>                # Ingest documents
docs2db chunk                        # Generate chunks
docs2db embed                        # Generate embeddings
docs2db load                         # Load into database
docs2db db-dump                      # Create SQL dump
docs2db db-restore <file>            # Restore from dump
docs2db audit                        # Check content directory

Each processing step (ingest, chunk, embed) creates files in docs2db_content/ that the next step reads.

Processing Options

Chunking

# Fast (skip contextual generation)
docs2db chunk --skip-context

# Custom LLM provider
docs2db chunk --context-model qwen2.5:7b-instruct              # Ollama
docs2db chunk --openai-url https://api.openai.com \           # OpenAI
  --context-model gpt-4o-mini
docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com # WatsonX

# Patterns and directories
docs2db chunk --pattern "docs/**/*.json"
docs2db chunk --content-dir my-content

Configuration via environment variables or .env file also supported. Run docs2db chunk --help for all options.

Embedding

# Different model
docs2db embed --model granite-30m-english

# Patterns and directories
docs2db embed --pattern "docs/**/*.chunks.json"
docs2db embed --content-dir my-content

Run docs2db embed --help for all options.

Content Directory

The content directory (default: docs2db_content/) stores:

  • Ingested documents in Docling JSON format
  • .chunks.json files with text chunks
  • .gran.json files with embeddings

Important: Commit this directory to version control. It contains expensive preprocessing that can be reused across updates. Docs2DB automatically skips files that haven't changed.

RAG Features

  • Contextual chunks - LLM-generated context for each chunk (Anthropic's approach)
  • Vector embeddings - Multiple models: granite-30m, e5-small-v2, slate-125m, noinstruct-small
  • Full-text search - PostgreSQL tsvector with GIN indexing for BM25
  • Vector similarity - pgvector extension with HNSW indexes
  • Schema versioning - Track metadata and schema changes
  • Portable dumps - Self-contained SQL files that work anywhere with docs2db-api
  • Incremental processing - Automatically skips unchanged files

Troubleshooting

"Neither Docker nor Podman found"

Install Docker (https://docs.docker.com/get-docker/) or Podman (https://podman.io/getting-started/installation)

"Database connection refused"

docs2db db-start      # Start the database
docs2db db-status     # Check connection

"Module not found" errors

Use uv tool install docs2db

Using as a Library

uv add docs2db
from docs2db import ingest_file, ingest_from_content

# Your code here

See docs2db --help for the full Python API.

Development

git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install

# Run tests
make test

# Run all checks
pre-commit run --all-files

See CONTRIBUTING.md for details.

Serving Your Database

Use docs2db-api to serve your RAG database with a REST API.

License

See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs2db-0.2.0.tar.gz (47.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docs2db-0.2.0-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file docs2db-0.2.0.tar.gz.

File metadata

  • Download URL: docs2db-0.2.0.tar.gz
  • Upload date:
  • Size: 47.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.11

File hashes

Hashes for docs2db-0.2.0.tar.gz
Algorithm Hash digest
SHA256 59c494cee04648abcc895addba54c05822a751830237d6cfbe188b80fad6f8e2
MD5 b0b56b0777434b7d59e34e1f22179873
BLAKE2b-256 0d155844f450960a886b733eff6331be3bd5ec333e9fec0bdd4e19182143be93

See more details on using hashes here.

File details

Details for the file docs2db-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docs2db-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.11

File hashes

Hashes for docs2db-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 70b8df661cca4b9c1efc719279a05f8db8e545f2272eb4ff3739001de6cdb39e
MD5 cfefc0a63275df70ffe972409c0c32bd
BLAKE2b-256 28b816c8ba988b304d91fb89769f4c6b66883ba36eb5d87f93eb6dd2a1f4630c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page