Repository of docling documents for RAG
Project description
Docs2DB
Build a RAG database from documents. Docs2DB processes documents into chunks and embeddings, loads them into PostgreSQL with pgvector, and produces portable SQL dumps.
What it does:
- Ingests documents (PDF, DOCX, XLSX, HTML, MD, CSV, etc.) using Docling
- Generates contextual chunks with LLM assistance
- Creates embeddings (Granite 30M by default)
- Loads into PostgreSQL with pgvector
- Produces portable
ragdb_dump.sqlfiles
What it's for:
- Creating databases for RAG systems that use docs2db-api
Installation
uv tool install docs2db
Requirements: Docker or Podman (for database management)
Quickstart
One command:
docs2db pipeline /path/to/your/documents
This starts a database, processes everything, and creates ragdb_dump.sql.
Next steps: See docs2db-api to use your database for RAG search. Follow one of its demos to use it with Llama Stack or integrate it into your agent.
Database Configuration
Configuration precedence (highest to lowest):
- CLI arguments:
--host,--port,--db,--user,--password - Environment variables:
POSTGRES_HOST,POSTGRES_PORT,POSTGRES_DB,POSTGRES_USER,POSTGRES_PASSWORD DATABASE_URL:postgresql://user:pass@host:port/databasepostgres-compose.ymlin current directory- Defaults:
localhost:5432, user=postgres, password=postgres, db=ragdb
Examples:
# Use defaults (docs2db db-start creates everything)
docs2db load
# Environment variables
export POSTGRES_HOST=prod.example.com
export POSTGRES_DB=mydb
docs2db load
# DATABASE_URL (cloud providers)
export DATABASE_URL="postgresql://user:pass@host:5432/db"
docs2db load
# CLI arguments
docs2db load --host localhost --db mydb
Note: Don't mix DATABASE_URL with individual POSTGRES_* variables.
Commands
Database Lifecycle
docs2db db-start # Start PostgreSQL (Docker/Podman)
docs2db db-stop # Stop PostgreSQL
docs2db db-logs # View logs (-f to follow)
docs2db db-destroy # Delete all data (prompts for confirmation)
docs2db db-status # Check connection and stats
Pipeline
docs2db pipeline <path> # Complete workflow
docs2db pipeline <path> \
--output-file my-rag.sql \ # Custom output
--skip-context \ # Skip contextual chunks (faster)
--model e5-small-v2 # Different embedding model
Individual Steps
These are the same steps pipeline runs.
docs2db ingest <path> # Ingest documents
docs2db chunk # Generate chunks
docs2db embed # Generate embeddings
docs2db load # Load into database
docs2db db-dump # Create SQL dump
docs2db db-restore <file> # Restore from dump
docs2db audit # Check content directory
Each processing step (ingest, chunk, embed) creates files in docs2db_content/ that the next step reads.
Processing Options
Chunking
# Fast (skip contextual generation)
docs2db chunk --skip-context
# Custom LLM provider
docs2db chunk --context-model qwen2.5:7b-instruct # Ollama
docs2db chunk --openai-url https://api.openai.com \ # OpenAI
--context-model gpt-4o-mini
docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com # WatsonX
# Patterns and directories
docs2db chunk --pattern "docs/**/*.json"
docs2db chunk --content-dir my-content
Configuration via environment variables or .env file also supported. Run docs2db chunk --help for all options.
Embedding
# Different model
docs2db embed --model granite-30m-english
# Patterns and directories
docs2db embed --pattern "docs/**/*.chunks.json"
docs2db embed --content-dir my-content
Run docs2db embed --help for all options.
Content Directory
The content directory (default: docs2db_content/) stores:
- Ingested documents in Docling JSON format
.chunks.jsonfiles with text chunks.gran.jsonfiles with embeddings
Important: Commit this directory to version control. It contains expensive preprocessing that can be reused across updates. Docs2DB automatically skips files that haven't changed.
RAG Features
- Contextual chunks - LLM-generated context for each chunk (Anthropic's approach)
- Vector embeddings - Multiple models: granite-30m, e5-small-v2, slate-125m, noinstruct-small
- Full-text search - PostgreSQL tsvector with GIN indexing for BM25
- Vector similarity - pgvector extension with HNSW indexes
- Schema versioning - Track metadata and schema changes
- Portable dumps - Self-contained SQL files that work anywhere with docs2db-api
- Incremental processing - Automatically skips unchanged files
Troubleshooting
"Neither Docker nor Podman found"
Install Docker (https://docs.docker.com/get-docker/) or Podman (https://podman.io/getting-started/installation)
"Database connection refused"
docs2db db-start # Start the database
docs2db db-status # Check connection
"Module not found" errors
Use uv tool install docs2db
Using as a Library
uv add docs2db
from docs2db import ingest_file, ingest_from_content
# Your code here
See docs2db --help for the full Python API.
Development
git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install
# Run tests
make test
# Run all checks
pre-commit run --all-files
See CONTRIBUTING.md for details.
Serving Your Database
Use docs2db-api to serve your RAG database with a REST API.
License
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docs2db-0.2.1.tar.gz.
File metadata
- Download URL: docs2db-0.2.1.tar.gz
- Upload date:
- Size: 47.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f9e4fa85a763a43f7b43a5b6fc53cbe45b3e9df9145adf36702b540d0c026a7
|
|
| MD5 |
da1961cde60fc48803eb64293aacc71d
|
|
| BLAKE2b-256 |
743ad65311dcbb9825886647954833004925033b640bb88cabd03ae924405b47
|
File details
Details for the file docs2db-0.2.1-py3-none-any.whl.
File metadata
- Download URL: docs2db-0.2.1-py3-none-any.whl
- Upload date:
- Size: 53.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27fa03d55b3b1939f8774d4c838e5b37d6eb44d84b1f478b64d590029c0c8e0e
|
|
| MD5 |
c1795cfdb73866e383f22ebdc50af0f1
|
|
| BLAKE2b-256 |
5bf70a4edbf5060d91e4e3544e04c40f1e7ac6d3ebc0864a69893d09e6c1bb64
|