A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems
Project description
docupipe
A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems.
Why docupipe?
In the age of AI, document management faces many challenges:
- Format conversion: Incompatible document formats between different systems
- Content migration: Batch document migration during knowledge base relocation or system switching
- Intelligent processing: Preparing standardized document content for knowledge graphs and retrieval systems
- Location transfer: Document transfer between different storage systems
docupipe provides a universal, extensible framework to solve these problems.
Key Features
- Plugin architecture: Four types of pluggable components: Source, Destination, Step, and Converter
- YAML configuration: Declarative configuration with environment variable interpolation
- State management: Support for resume and incremental sync
- Multiple document sources: DingTalk Knowledge Base, local file system, etc.
- Multiple destination systems: Local files, HindSight Memory, etc.
- Format conversion: Integration with markitdown, MinerU, and other conversion engines
- Intelligent processing: AI-powered image description and other processing steps
Installation
Via pip (recommended)
pip install docupipe
For PDF with embedded images (requires OCR), install the optional dependency:
pip install "docupipe[mineru]"
From source
# Clone the repository
git clone <repository-url>
cd docupipe
# Install dependencies (uv recommended)
pip install uv
uv pip install -e ".[dev]"
# For PDF with embedded images (requires OCR)
uv pip install -e ".[mineru]"
# Or install all optional dependencies
uv pip install -e ".[all]"
# Or use pip
pip install -e ".[dev]"
pip install -e ".[mineru]" # PDF support
Quick Start
The following example uses local files as both source and destination, requiring no external dependencies.
1. Prepare configuration file
Create docupipe.yaml:
pipelines:
- name: quick-start
source:
localdrive:
input_dir: ./input
include: ["*.md"]
destination:
localdrive:
output_dir: ./output
steps: []
2. Prepare test files
mkdir -p input output
echo "Hello, docupipe!" > input/hello.md
3. Run the pipeline
python -m docupipe run
View the output:
cat output/hello.md
Command Line Options
python -m docupipe run [OPTIONS]
Options:
--config PATH Configuration file path (default: docupipe.yaml)
--pipeline NAME Specify pipeline name
--resume Skip already processed documents
--sync Sync only changed documents
--dry-run Print only, don't execute
--state-dir PATH State file directory (default: ./.state)
--log-level LEVEL Log level (DEBUG/INFO/WARNING/ERROR)
# List available components
python -m docupipe sources # List all Sources
python -m docupipe destinations # List all Destinations
Configuration
Global Configuration
# HindSight Memory configuration
hindsight:
api_url: ${HINDSIGHT_API_URL}
api_key: ${HINDSIGHT_API_KEY}
bank_id: ${HINDSIGHT_BANK_ID}
# Image description configuration
image_description:
api_key: ${IMAGE_DESCRIPTION_API_KEY}
base_url: ${IMAGE_DESCRIPTION_BASE_URL}
model: ${IMAGE_DESCRIPTION_MODEL:-gpt-4o}
# File type conversion rules
converters:
extensions:
".pdf": mineru
".docx": markitdown
".pptx": markitdown
Pipeline Configuration
Each pipeline contains:
source: Data source configurationdestination: Destination configurationsteps: List of processing stepsoptions: Optional configuration (resume, sync, etc.)
Environment Variables
Create a .env file (only needed when using HindSight Memory or image description):
# HindSight Memory configuration
HINDSIGHT_API_URL=http://localhost:8888
HINDSIGHT_API_KEY=your_api_key
HINDSIGHT_BANK_ID=your_bank_id
# Image description API configuration
IMAGE_DESCRIPTION_API_KEY=your_api_key
IMAGE_DESCRIPTION_BASE_URL=http://localhost:8002/v1
IMAGE_DESCRIPTION_MODEL=gpt-4o
Environment Variable Interpolation
Supports ${VAR} and ${VAR:-default} syntax:
api_key: ${API_KEY} # Required
model: ${MODEL:-gpt-4o} # Default value
base_url: ${BASE_URL:-http://localhost:8080} # Default value
Use Cases
Use Case 1: Download documents from DingTalk Knowledge Base to local
Before using DingTalk Knowledge Base, install dws (official DingTalk CLI) and complete authentication:
# Install dws (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/DingTalk-Real-AI/dingtalk-workspace-cli/main/scripts/install.sh | sh
# Or via npm
npm install -g dingtalk-workspace-cli
# Authenticate (browser QR code)
dws auth login
# Headless environment use device flow
dws auth login --device
If your organization has not enabled CLI access, scan the QR code and apply to the administrator as prompted. Administrators can enable it in DingTalk Open Platform → "CLI Access Management".
Configure the pipeline:
pipelines:
- name: dingtalk-download
source:
dingtalk:
# Use knowledge base name (program will auto-query ID)
space: "Product Knowledge Base"
# Or use space_id directly
# space_id: "kfiwoue83nkxQXyA"
folders: ["Product Planning Materials"]
include_types: [DOCUMENT, ALIDOC]
destination:
localdrive:
output_dir: ./output/dingtalk
steps: []
Use Case 2: Local document format conversion
pipelines:
- name: convert-docs
source:
localdrive:
input_dir: ./output/dingtalk
include: ["*.docx"]
destination:
localdrive:
output_dir: ./output/markdown
steps:
- convert # Convert to markdown
- image_description # Add descriptions to images
Use Case 3: Write local documents to HindSight Memory
pipelines:
- name: to-hindsight
source:
localdrive:
input_dir: ./output/markdown
include: ["*.md"]
destination:
hindsight:
context_prefix: "Product Knowledge Base"
steps: []
Use Case 4: ALL IN ONE
pipelines:
- name: full-pipeline
source:
dingtalk:
space: "Product Knowledge Base"
destination:
hindsight:
context_prefix: "Knowledge Base"
steps:
- convert
- image_description
Available Components
Source
dingtalk: DingTalk Knowledge Baselocaldrive: Local file system
Destination
localdrive: Local file systemhindsight: HindSight Memory
Step
convert: Document format conversionimage_description: Image description generation
Converter
markitdown: Common office documentsmineru: High-quality PDF conversion
State Management
docupipe maintains state files ({source}_{dest}_state.json) for each source-dest combination, recording:
- Processed document IDs
- Document hashes (for change detection)
Run Modes
- Default mode: Process all documents
- --resume: Skip already processed documents
- --sync: Sync only changed documents, remove documents deleted from source
Architecture
source.list_documents() → [DocumentMeta]
→ filter (resume skips processed / sync only syncs changes)
→ source.fetch(meta) → Document
→ steps process sequentially (convert → image_description → ...)
→ dest.write(doc)
→ state.mark_done()
Development
Running Tests
# Run all tests
python -m pytest tests/ -v
# Run single test file
python -m pytest tests/test_pipeline.py -v
Adding New Components
All components use decorators for registration. Adding a new component requires three steps:
- Implement the abstract base class
- Add the decorator:
@register_source("name") - Import in init.py
Example:
# sources/custom.py
from docupipe.sources.base import BaseSource
from docupipe.sources import register_source
@register_source("custom")
class CustomSource(BaseSource):
def list_documents(self):
# Implement document list logic
pass
def fetch(self, meta):
# Implement document fetch logic
pass
Dependencies
- Python 3.11+
- Click (CLI framework)
- Rich (Terminal output)
- PyYAML (Configuration parsing)
- markitdown (Document conversion)
- MinerU (PDF OCR conversion with embedded images)
- hindsight-client (HindSight Memory client)
- OpenAI SDK (Image description)
License
MIT License
Contributing
Issues and Pull Requests are welcome!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docupipe-0.1.1.tar.gz.
File metadata
- Download URL: docupipe-0.1.1.tar.gz
- Upload date:
- Size: 47.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f24232ea491144c3c710f1ba96c9e150d036232756b0c97f0a299578ac1ffd99
|
|
| MD5 |
faa61ba841aa83cb1ffeff0ed6d8bcb7
|
|
| BLAKE2b-256 |
b8f24d5cbc2486520081a05c90b0b761c3fe6c1d90597de2127daac961db855e
|
Provenance
The following attestation bundles were made for docupipe-0.1.1.tar.gz:
Publisher:
release.yml on liling/docupipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docupipe-0.1.1.tar.gz -
Subject digest:
f24232ea491144c3c710f1ba96c9e150d036232756b0c97f0a299578ac1ffd99 - Sigstore transparency entry: 1553832213
- Sigstore integration time:
-
Permalink:
liling/docupipe@67c10058a6e97e15202b15e4e579d79d23b95aae -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/liling
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@67c10058a6e97e15202b15e4e579d79d23b95aae -
Trigger Event:
push
-
Statement type:
File details
Details for the file docupipe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docupipe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 38.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8eef47de877e70e2b73c0d57344e0efb11b8b2ee1070b47d754873d8144fc0c
|
|
| MD5 |
1fe8c67183092db3a627171e2e27eb33
|
|
| BLAKE2b-256 |
988f2c4cf6b93fbbe0ad59e88ad84971c77aeb9aa5f3595692a9c2cbd79e54fb
|
Provenance
The following attestation bundles were made for docupipe-0.1.1-py3-none-any.whl:
Publisher:
release.yml on liling/docupipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docupipe-0.1.1-py3-none-any.whl -
Subject digest:
c8eef47de877e70e2b73c0d57344e0efb11b8b2ee1070b47d754873d8144fc0c - Sigstore transparency entry: 1553832224
- Sigstore integration time:
-
Permalink:
liling/docupipe@67c10058a6e97e15202b15e4e579d79d23b95aae -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/liling
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@67c10058a6e97e15202b15e4e579d79d23b95aae -
Trigger Event:
push
-
Statement type: