Skip to main content

A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems

Project description

docupipe

A universal document transfer and processing tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems. Inspired by KETTLE, it treats documents and their attachments as an atomic Bundle that flows through the pipeline and is processed incrementally.

Why docupipe?

In the age of AI, document management faces many challenges:

  • Format conversion: Incompatible document formats between different systems
  • Content migration: Batch document migration during knowledge base relocation or system switching
  • Intelligent processing: Preparing standardized document content for knowledge graphs and retrieval systems
  • Location transfer: Document transfer between different storage systems

docupipe provides a universal, extensible framework to solve these problems.

Key Features

  • Plugin architecture: Four types of pluggable components: Source, Destination, Step, and Converter
  • YAML configuration: Declarative configuration with environment variable interpolation
  • State management: Support for resume and incremental sync
  • Multiple document sources: DingTalk Knowledge Base, local file system, etc.
  • Multiple destination systems: Local files, HindSight Memory, etc.
  • Format conversion: Integration with markitdown, MinerU, and other conversion engines
  • Intelligent processing: AI-powered image description and other processing steps

Installation

Via pip (recommended)

pip install docupipe

For PDF with embedded images (requires OCR), install the optional dependency:

pip install "docupipe[mineru]"

From source

# Clone the repository
git clone https://github.com/liling/docupipe.git
cd docupipe

# Install dependencies (uv recommended)
pip install uv
uv pip install -e ".[dev]"

# For PDF with embedded images (requires OCR)
uv pip install -e ".[mineru]"

# Or install all optional dependencies
uv pip install -e ".[all]"

# Or use pip
pip install -e ".[dev]"
pip install -e ".[mineru]"  # PDF support

Quick Start

The following example uses local files as both source and destination, requiring no external dependencies.

1. Prepare configuration file

Create docupipe.yaml:

pipelines:
  - name: quick-start
    source:
      localdrive:
        input_dir: ./input
        include: ["*.md"]
    destination:
      localdrive:
        output_dir: ./output
    steps: []

2. Prepare test files

mkdir -p input output
echo "Hello, docupipe!" > input/hello.md

3. Run the pipeline

python -m docupipe run

View the output:

cat output/hello.md

Command Line Options

python -m docupipe run [OPTIONS]

Options:
  --config PATH                 Configuration file path (default: docupipe.yaml)
  --pipeline NAME               Specify pipeline name
  --mode MODE                   Run mode (full/incremental/mirror)
  --resume                      Full mode resume from checkpoint
  --change-detection STRATEGY   Change detection strategy (mtime/hash, mirror mode only)
  --dry-run                     Print only, don't execute
  --state-dir PATH              State file directory (default: ./.state)
  --log-level LEVEL             Log level (DEBUG/INFO/WARNING/ERROR)

# List available components
python -m docupipe sources       # List all Sources
python -m docupipe destinations  # List all Destinations

Configuration

Global Configuration

# HindSight Memory configuration
hindsight:
  api_url: ${HINDSIGHT_API_URL}
  api_key: ${HINDSIGHT_API_KEY}
  bank_id: ${HINDSIGHT_BANK_ID}

# Image description configuration
image_description:
  api_key: ${IMAGE_DESCRIPTION_API_KEY}
  base_url: ${IMAGE_DESCRIPTION_BASE_URL}
  model: ${IMAGE_DESCRIPTION_MODEL:-gpt-4o}

# File type conversion rules
converters:
  extensions:
    ".pdf": mineru
    ".docx": markitdown
    ".pptx": markitdown

Pipeline Configuration

Each pipeline contains:

  • source: Data source configuration
  • destination: Destination configuration
  • steps: List of processing steps
  • options: Optional configuration (resume, sync, etc.)

Environment Variables

Create a .env file (only needed when using HindSight Memory or image description):

# HindSight Memory configuration
HINDSIGHT_API_URL=http://localhost:8888
HINDSIGHT_API_KEY=your_api_key
HINDSIGHT_BANK_ID=your_bank_id

# Image description API configuration
IMAGE_DESCRIPTION_API_KEY=your_api_key
IMAGE_DESCRIPTION_BASE_URL=http://localhost:8002/v1
IMAGE_DESCRIPTION_MODEL=gpt-4o

Environment Variable Interpolation

Supports ${VAR} and ${VAR:-default} syntax:

api_key: ${API_KEY}                          # Required
model: ${MODEL:-gpt-4o}                      # Default value
base_url: ${BASE_URL:-http://localhost:8080} # Default value

Use Cases

Use Case 1: Download documents from DingTalk Knowledge Base to local

Before using DingTalk Knowledge Base, install dws (official DingTalk CLI) and complete authentication:

# Install dws (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/DingTalk-Real-AI/dingtalk-workspace-cli/main/scripts/install.sh | sh

# Or via npm
npm install -g dingtalk-workspace-cli

# Authenticate (browser QR code)
dws auth login

# Headless environment use device flow
dws auth login --device

If your organization has not enabled CLI access, scan the QR code and apply to the administrator as prompted. Administrators can enable it in DingTalk Open Platform → "CLI Access Management".

Configure the pipeline:

pipelines:
  - name: dingtalk-download
    source:
      dingtalk:
        # Use knowledge base name (program will auto-query ID)
        space: "Product Knowledge Base"
        # Or use space_id directly
        # space_id: "kfiwoue83nkxQXyA"
        folders: ["Product Planning Materials"]
        include_types: [DOCUMENT, ALIDOC]
    destination:
      localdrive:
        output_dir: ./output/dingtalk
    steps: []

Use Case 2: Local document format conversion

pipelines:
  - name: convert-docs
    source:
      localdrive:
        input_dir: ./output/dingtalk
        include: ["*.docx"]
    destination:
      localdrive:
        output_dir: ./output/markdown
    steps:
      - convert          # Convert to markdown
      - image_description # Add descriptions to images

Use Case 3: Write local documents to HindSight Memory

pipelines:
  - name: to-hindsight
    source:
      localdrive:
        input_dir: ./output/markdown
        include: ["*.md"]
    destination:
      hindsight:
        context_prefix: "Product Knowledge Base"
    steps: []

Use Case 4: ALL IN ONE

pipelines:
  - name: full-pipeline
    source:
      dingtalk:
        space: "Product Knowledge Base"
    destination:
      hindsight:
        context_prefix: "Knowledge Base"
    steps:
      - convert
      - image_description

Available Components

Source

  • dingtalk: DingTalk Knowledge Base (wiki/doc dual mode)
  • localdrive: Local file system
  • tencent: Tencent Docs (MCP protocol)

Destination

  • localdrive: Local file system
  • hindsight: HindSight Memory

Step

  • convert: Document format conversion (via Converter)
  • image_description: AI-powered image description
  • excel_structured: Excel → structured Markdown tables
  • resolve_attachments: Resolve local file references in Markdown
  • s3_upload: Upload attachments to S3-compatible storage
  • tencent_delete: Delete processed Tencent docs (use in finalize_steps)

Converter

  • markitdown: Common office documents
  • mineru: High-quality PDF conversion (with OCR)

State Management

docupipe maintains state files ({source}_{dest}_state.json) for each source-dest combination, recording:

  • Processed document IDs
  • Document hashes (for change detection)

Run Modes

  • full: Call source.list() for all documents, process each one
  • full + --resume: Skip list(), continue from pending state
  • incremental: List all, only process newly added documents
  • mirror: Detect changes (mtime/hash) + remove deleted documents

Architecture

source.list() → [BundleMeta]
  → filter (resume skips done / incremental only new / mirror detects changes)
    → source.fetch(meta) → Bundle
      → steps process sequentially (convert → image_description → ...)
        → dest.write(bundle)
          → state.mark_done()
            → post_steps (optional, e.g. delete source)
After all documents:
  → finalize_steps (batch post-processing, e.g. Tencent doc cleanup)

Development

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run single test file
python -m pytest tests/test_pipeline.py -v

Adding New Components

All components use decorators for registration. Adding a new component requires three steps:

  1. Implement the abstract base class
  2. Add the decorator: @register_source("name")
  3. Import in init.py

Example (Source):

# sources/custom.py
from docupipe.models import Bundle, BundleMeta
from docupipe.sources import register_source
from docupipe.sources.base import SourceBase

@register_source("custom")
class CustomSource(SourceBase):
    def list(self) -> list[BundleMeta]:
        # Return document metadata list
        ...

    def fetch(self, meta: BundleMeta) -> Bundle:
        # Fetch document content by meta
        ...

See Add New Component for details.

Documentation

See docs/ for detailed documentation:

Type Document
📖 Tutorial Quick Start — DingTalk to Hindsight Memory
📋 How-to Configure Pipeline, Add New Component
📚 Reference Configuration, API Reference, Components
💡 Explanation Architecture, Run Modes

Dependencies

  • Python 3.11+
  • Click (CLI framework)
  • Rich (Terminal output)
  • PyYAML (Configuration parsing)
  • markitdown (Document conversion)
  • MinerU (PDF OCR conversion with embedded images)
  • hindsight-client (HindSight Memory client)
  • OpenAI SDK (Image description)

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docupipe-0.1.3.tar.gz (74.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docupipe-0.1.3-py3-none-any.whl (54.0 kB view details)

Uploaded Python 3

File details

Details for the file docupipe-0.1.3.tar.gz.

File metadata

  • Download URL: docupipe-0.1.3.tar.gz
  • Upload date:
  • Size: 74.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1b4a73c3cdcff15aa590d909d2d2d4c9c7f1e1a0045029338178c8eec3e2acd1
MD5 c62b660628c2a700afbf0c704f6161f8
BLAKE2b-256 fe8faa9437c7ed1def012da8bfc3c14aa1176789dccc7157d1d9fb4d6d29ce26

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.3.tar.gz:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docupipe-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: docupipe-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 54.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dc7a919ec6f26ff5319cff52317dfe21f238c6d8f2f84c3111d5645f28d9edfb
MD5 8787b35575622968d7218f8fcd1ff489
BLAKE2b-256 c2d0316e3b754b251176d5a775f2976e0d665bb76be515e723793f1f91d3c8c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.3-py3-none-any.whl:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page