A Python package for synthesizing and working with document data.
Project description
Docs2Synth
Docs2Synth converts, synthesizes, and trains retrievers for document datasets.
Workflow
Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment
🚀 Quick Start: Automated Pipeline
Run the complete end-to-end pipeline with a single command:
docs2synth run
This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.
Manual Step-by-Step Workflow
For more control, run each step individually:
# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/
# 2. Generate QA pairs
docs2synth qa batch
# 3. Verify quality
docs2synth verify batch
# 4. Annotate (opens UI)
docs2synth annotate
# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10
# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app
Installation
PyPI Installation (Recommended)
CPU Version (includes all features + MCP server):
pip install docs2synth[cpu]
GPU Version (includes all features + MCP server):
# First install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Then install Docs2Synth with GPU extras
pip install docs2synth[gpu]
Minimal Install (CLI only, no ML/MCP features):
pip install docs2synth
Development Setup
Use the setup script (installs uv + dependencies automatically):
# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
# Run setup script
./setup.sh # Unix/macOS/WSL
# setup.bat # Windows
The script:
- Installs uv (fast package manager)
- Creates virtual environment
- Installs dependencies (CPU or GPU)
- Sets up config
Manual development setup:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate # .venv\Scripts\activate on Windows
# Install for development
uv pip install -e ".[cpu,dev]" # or [gpu,dev] for GPU
# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys
Features
- Document Processing: Extract text/layout with Docling, PaddleOCR, PDFPlumber
- QA Generation: Automatic question-answer pair generation with LLMs
- Verification: Built-in meaningful and correctness verifiers
- Human Annotation: Streamlit UI for manual review
- Retriever Training: Train LayoutLMv3-based retrievers
- RAG Deployment: Deploy with naive or iterative strategies
- MCP Integration: Expose as Model Context Protocol server
Configuration
Create config.yml from config.example.yml:
# API keys (config.yml is in .gitignore)
agent:
keys:
openai_api_key: "sk-..."
anthropic_api_key: "sk-ant-..."
# Document processing
preprocess:
processor: docling
input_dir: ./data/raw/
output_dir: ./data/processed/
# QA generation
qa:
strategies:
- strategy: semantic
provider: openai
model: gpt-4o-mini
# Retriever training
retriever:
learning_rate: 1e-5
epochs: 10
# RAG
rag:
embedding:
model: sentence-transformers/all-MiniLM-L6-v2
Docker
# CPU
./scripts/build-docker.sh cpu
# GPU
./scripts/build-docker.sh gpu
See Docker Builds
Documentation
Full documentation: https://ai4wa.github.io/Docs2Synth/
- Complete Workflow Guide
- CLI Reference
- Document Processing
- QA Generation
- Retriever Training
- RAG Deployment
Contributing
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
pytest tests/ -v - Run code quality checks:
./scripts/check.sh - Submit a pull request
See Dependency Management for dev setup details.
License
MIT License - see LICENSE file for details.
Citation
If you use Docs2Synth in your research, please cite:
@software{docs2synth2024,
title = {Docs2Synth: Document Processing and Retriever Training},
author = {AI4WA Team},
year = {2024},
url = {https://github.com/AI4WA/Docs2Synth}
}
Support
- Documentation: https://ai4wa.github.io/Docs2Synth/
- Issues: https://github.com/AI4WA/Docs2Synth/issues
- Discussions: https://github.com/AI4WA/Docs2Synth/discussions
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docs2synth-1.0.1.tar.gz.
File metadata
- Download URL: docs2synth-1.0.1.tar.gz
- Upload date:
- Size: 105.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdf325d013797cc6ae137c3a8fd912df58d2bca4404babc631529310d6a72a62
|
|
| MD5 |
4fa3e58285b6fe55af2252f90ef1a515
|
|
| BLAKE2b-256 |
2925869b64be37b10a6444989cdf5a87204cc0b0c6a2dbaced648b4886f52a99
|
File details
Details for the file docs2synth-1.0.1-py3-none-any.whl.
File metadata
- Download URL: docs2synth-1.0.1-py3-none-any.whl
- Upload date:
- Size: 80.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea75e403fa4af70df05cedf7033df869764b574f52c6eba688fad7785aaf733b
|
|
| MD5 |
9f55359b6057df241f1090468e24bfe1
|
|
| BLAKE2b-256 |
b8ce69c753389066e31d225c1ff584ceb69c8b54004da5a2fdd57108c2175977
|