Repository of docling documents for RAG

Project description

Docs2DB

Build a RAG database from documents. Docs2DB processes documents into chunks and embeddings, loads them into PostgreSQL with pgvector, and produces portable SQL dumps.

What it does:

Ingests documents (PDF, DOCX, XLSX, HTML, MD, CSV, etc.) using Docling
Generates contextual chunks with LLM assistance
Creates embeddings (Granite 30M by default)
Loads into PostgreSQL with pgvector
Produces portable ragdb_dump.sql files

What it's for:

Creating databases for RAG systems that use docs2db-api

Installation

uv tool install docs2db

Requirements: Docker or Podman (for database management)

Quickstart

One command:

docs2db pipeline /path/to/your/documents

This starts a database, processes everything, and creates ragdb_dump.sql.

Next steps: See docs2db-api to use your database for RAG search. Follow one of its demos to use it with Llama Stack or integrate it into your agent.

Database Configuration

Configuration precedence (highest to lowest):

CLI arguments: --host, --port, --db, --user, --password
Environment variables: POSTGRES_HOST, POSTGRES_PORT, POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD
DATABASE_URL: postgresql://user:pass@host:port/database
postgres-compose.yml in current directory
Defaults: localhost:5432, user=postgres, password=postgres, db=ragdb

Examples:

# Use defaults (docs2db db-start creates everything)
docs2db load

# Environment variables
export POSTGRES_HOST=prod.example.com
export POSTGRES_DB=mydb
docs2db load

# DATABASE_URL (cloud providers)
export DATABASE_URL="postgresql://user:pass@host:5432/db"
docs2db load

# CLI arguments
docs2db load --host localhost --db mydb

Note: Don't mix DATABASE_URL with individual POSTGRES_* variables.

Commands

Database Lifecycle

docs2db db-start      # Start PostgreSQL (Docker/Podman)
docs2db db-stop       # Stop PostgreSQL
docs2db db-logs       # View logs (-f to follow)
docs2db db-destroy    # Delete all data (prompts for confirmation)
docs2db db-status     # Check connection and stats

Pipeline

docs2db pipeline <path>              # Complete workflow
docs2db pipeline <path> \
  --output-file my-rag.sql \         # Custom output
  --skip-context \                   # Skip contextual chunks (faster)
  --model e5-small-v2                # Different embedding model

Individual Steps

These are the same steps pipeline runs.

docs2db ingest <path>                # Ingest documents
docs2db chunk                        # Generate chunks
docs2db embed                        # Generate embeddings
docs2db load                         # Load into database
docs2db db-dump                      # Create SQL dump
docs2db db-restore <file>            # Restore from dump
docs2db audit                        # Check content directory

Each processing step (ingest, chunk, embed) creates files in docs2db_content/ that the next step reads.

Processing Options

Chunking

# Fast (skip contextual generation)
docs2db chunk --skip-context

# Custom LLM provider
docs2db chunk --context-model qwen2.5:7b-instruct              # Ollama
docs2db chunk --openai-url https://api.openai.com \           # OpenAI
  --context-model gpt-4o-mini
docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com # WatsonX

# Patterns and directories
docs2db chunk --pattern "docs/**/*.json"
docs2db chunk --content-dir my-content

Configuration via environment variables or .env file also supported. Run docs2db chunk --help for all options.

Embedding

# Different model
docs2db embed --model granite-30m-english

# Patterns and directories
docs2db embed --pattern "docs/**/*.chunks.json"
docs2db embed --content-dir my-content

Run docs2db embed --help for all options.

Content Directory

The content directory (default: docs2db_content/) stores:

Ingested documents in Docling JSON format
.chunks.json files with text chunks
.gran.json files with embeddings

Important: Commit this directory to version control. It contains expensive preprocessing that can be reused across updates. Docs2DB automatically skips files that haven't changed.

RAG Features

Contextual chunks - LLM-generated context for each chunk (Anthropic's approach)
Vector embeddings - Multiple models: granite-30m, e5-small-v2, slate-125m, noinstruct-small
Full-text search - PostgreSQL tsvector with GIN indexing for BM25
Vector similarity - pgvector extension with HNSW indexes
Schema versioning - Track metadata and schema changes
Portable dumps - Self-contained SQL files that work anywhere with docs2db-api
Incremental processing - Automatically skips unchanged files

Troubleshooting

"Neither Docker nor Podman found"

Install Docker (https://docs.docker.com/get-docker/) or Podman (https://podman.io/getting-started/installation)

"Database connection refused"

docs2db db-start      # Start the database
docs2db db-status     # Check connection

"Module not found" errors

Use uv tool install docs2db

Using as a Library

uv add docs2db

from docs2db import ingest_file, ingest_from_content

# Your code here

See docs2db --help for the full Python API.

Development

git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install

# Run tests
make test

# Run all checks
pre-commit run --all-files

See CONTRIBUTING.md for details.

Serving Your Database

Use docs2db-api to serve your RAG database with a REST API.

License

See LICENSE for details.

Project details

Release history Release notifications | RSS feed

0.4.4

Mar 16, 2026

0.4.3

Jan 9, 2026

0.4.2

Nov 12, 2025

0.4.1

Nov 12, 2025

0.4.0

Nov 11, 2025

0.3.1

Nov 6, 2025

0.3.0

Nov 6, 2025

0.2.1

Nov 4, 2025

This version

0.2.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs2db-0.2.0.tar.gz (47.3 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docs2db-0.2.0-py3-none-any.whl (53.7 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file docs2db-0.2.0.tar.gz.

File metadata

Download URL: docs2db-0.2.0.tar.gz
Upload date: Nov 3, 2025
Size: 47.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.11

File hashes

Hashes for docs2db-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`59c494cee04648abcc895addba54c05822a751830237d6cfbe188b80fad6f8e2`
MD5	`b0b56b0777434b7d59e34e1f22179873`
BLAKE2b-256	`0d155844f450960a886b733eff6331be3bd5ec333e9fec0bdd4e19182143be93`

See more details on using hashes here.

File details

Details for the file docs2db-0.2.0-py3-none-any.whl.

File metadata

Download URL: docs2db-0.2.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 53.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.11

File hashes

Hashes for docs2db-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70b8df661cca4b9c1efc719279a05f8db8e545f2272eb4ff3739001de6cdb39e`
MD5	`cfefc0a63275df70ffe972409c0c32bd`
BLAKE2b-256	`28b816c8ba988b304d91fb89769f4c6b66883ba36eb5d87f93eb6dd2a1f4630c`

See more details on using hashes here.

docs2db 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Docs2DB

Installation

Quickstart

Database Configuration

Commands

Database Lifecycle

Pipeline

Individual Steps

Processing Options

Chunking

Embedding

Content Directory

RAG Features

Troubleshooting

"Neither Docker nor Podman found"

"Database connection refused"

"Module not found" errors

Using as a Library

Development

Serving Your Database

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes