Skip to main content

A Python package for synthesizing and working with document data.

Project description

Docs2Synth

Documentation License: MIT Python 3.11+

Docs2Synth converts, synthesizes, and trains retrievers for document datasets.

Workflow

Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment

🚀 Quick Start: Automated Pipeline

Run the complete end-to-end pipeline with a single command:

docs2synth run

This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.

Manual Step-by-Step Workflow

For more control, run each step individually:

# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/

# 2. Generate QA pairs
docs2synth qa batch

# 3. Verify quality
docs2synth verify batch

# 4. Annotate (opens UI)
docs2synth annotate

# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10

# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app

Complete Workflow Guide →


Installation

PyPI Installation (Recommended)

CPU Version (includes all features + MCP server):

pip install docs2synth[cpu]

GPU Version (includes all features + MCP server):

# Install with PyTorch CPU (upgrade to CUDA version if needed)
pip install docs2synth[gpu]

# Optional: Add vLLM support (requires CUDA GPU)
# First install PyTorch with CUDA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Then install vLLM:
pip install docs2synth[gpu,vllm]

Minimal Install (CLI only, no ML/MCP features):

pip install docs2synth

Development Setup

Use the setup script (installs uv + dependencies automatically):

# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth

# Run setup script
./setup.sh         # Unix/macOS/WSL
# setup.bat        # Windows

The script:

  • Installs uv (fast package manager)
  • Creates virtual environment
  • Installs dependencies (CPU or GPU)
  • Sets up config

Manual development setup:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows

# Install for development
uv pip install -e ".[cpu,dev]"  # or [gpu,dev] for GPU

# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys

Features

  • Document Processing: Extract text/layout with Docling, PaddleOCR, PDFPlumber
  • QA Generation: Automatic question-answer pair generation with LLMs
  • Verification: Built-in meaningful and correctness verifiers
  • Human Annotation: Streamlit UI for manual review
  • Retriever Training: Train LayoutLMv3-based retrievers
  • RAG Deployment: Deploy with naive or iterative strategies
  • MCP Integration: Expose as Model Context Protocol server

Configuration

Create config.yml from config.example.yml:

# API keys (config.yml is in .gitignore)
agent:
  keys:
    openai_api_key: "sk-..."
    anthropic_api_key: "sk-ant-..."

# Document processing
preprocess:
  processor: docling
  input_dir: ./data/raw/
  output_dir: ./data/processed/

# QA generation
qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini

# Retriever training
retriever:
  learning_rate: 1e-5
  epochs: 10

# RAG
rag:
  embedding:
    model: sentence-transformers/all-MiniLM-L6-v2

Docker

# CPU
./scripts/build-docker.sh cpu

# GPU
./scripts/build-docker.sh gpu

See Docker Builds


Documentation

Full documentation: https://ai4wa.github.io/Docs2Synth/


Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: pytest tests/ -v
  5. Run code quality checks: ./scripts/check.sh
  6. Submit a pull request

See Dependency Management for dev setup details.


License

MIT License - see LICENSE file for details.


Citation

If you use Docs2Synth in your research, please cite:

@software{docs2synth2024,
  title = {Docs2Synth: Document Processing and Retriever Training},
  author = {AI4WA Team},
  year = {2024},
  url = {https://github.com/AI4WA/Docs2Synth}
}

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs2synth-1.0.4.tar.gz (261.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docs2synth-1.0.4-py3-none-any.whl (277.0 kB view details)

Uploaded Python 3

File details

Details for the file docs2synth-1.0.4.tar.gz.

File metadata

  • Download URL: docs2synth-1.0.4.tar.gz
  • Upload date:
  • Size: 261.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docs2synth-1.0.4.tar.gz
Algorithm Hash digest
SHA256 aba5ab4d4cff60ffd16f262caf6a22eeefb051cc54dfac1b81a65b67bbacba8b
MD5 557d66d30771fc4f78072d6115eedb10
BLAKE2b-256 e4cd85fb0cf02e37570f462d6cb9758d1f212cd9548ad44147e1d33dedb91db9

See more details on using hashes here.

File details

Details for the file docs2synth-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: docs2synth-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 277.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docs2synth-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8bc24dc26e0cfdcbb3c21d7b4a80e724a701670fa1dfceeccb363ec6e9972e78
MD5 fa334bccfec2d6890f79166f651236d9
BLAKE2b-256 69146a4a4ae63a63f2c0f9f8218bb0c9285ac0380585da4efbf56fd3a6d0d7b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page