Skip to main content

A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems

Project description

docupipe

A universal document transfer and processing tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems. Inspired by KETTLE, it treats documents and their attachments as an atomic Bundle that flows through the pipeline and is processed incrementally.

Why docupipe?

In the age of AI, document management faces many challenges:

  • Format conversion: Incompatible document formats between different systems
  • Content migration: Batch document migration during knowledge base relocation or system switching
  • Intelligent processing: Preparing standardized document content for knowledge graphs and retrieval systems
  • Location transfer: Document transfer between different storage systems

docupipe provides a universal, extensible framework to solve these problems.

Key Features

  • Plugin architecture: Four types of pluggable components: Source, Destination, Step, and Converter
  • YAML configuration: Declarative configuration with environment variable interpolation
  • State management: Support for resume and incremental sync
  • Multiple document sources: DingTalk Knowledge Base, local file system, etc.
  • Multiple destination systems: Local files, HindSight Memory, etc.
  • Format conversion: Integration with markitdown, MinerU, and other conversion engines
  • Intelligent processing: AI-powered image description and other processing steps

Installation

Via pip (recommended)

pip install docupipe

For PDF with embedded images (requires OCR), install the optional dependency:

pip install "docupipe[mineru]"

From source

# Clone the repository
git clone https://github.com/liling/docupipe.git
cd docupipe

# Install dependencies (uv recommended)
pip install uv
uv pip install -e ".[dev]"

# For PDF with embedded images (requires OCR)
uv pip install -e ".[mineru]"

# Or install all optional dependencies
uv pip install -e ".[all]"

# Or use pip
pip install -e ".[dev]"
pip install -e ".[mineru]"  # PDF support

Quick Start

The following example uses local files as both source and destination, requiring no external dependencies.

1. Prepare configuration file

Create docupipe.yaml:

pipelines:
  - name: quick-start
    source:
      localdrive:
        input_dir: ./input
        include: ["*.md"]
    destination:
      localdrive:
        output_dir: ./output
    steps: []

2. Prepare test files

mkdir -p input output
echo "Hello, docupipe!" > input/hello.md

3. Run the pipeline

python -m docupipe run

View the output:

cat output/hello.md

Command Line Options

python -m docupipe run [OPTIONS]

Options:
  --config PATH                 Configuration file path (default: docupipe.yaml)
  --pipeline NAME               Specify pipeline name
  --resume                      Full mode resume from checkpoint
  --mode MODE                   Run mode (full/incremental/mirror)
  --change-detection STRATEGY   Change detection strategy (mtime/hash)
  --dry-run                     Print only, don't execute
  --state-dir PATH              State file directory (default: ./.state)
  --log-level LEVEL             Log level (DEBUG/INFO/WARNING/ERROR)

# List available components
python -m docupipe sources       # List all Sources
python -m docupipe destinations  # List all Destinations

Configuration

Global Configuration

# HindSight Memory configuration
hindsight:
  api_url: ${HINDSIGHT_API_URL}
  api_key: ${HINDSIGHT_API_KEY}
  bank_id: ${HINDSIGHT_BANK_ID}

# Image description configuration
image_description:
  api_key: ${IMAGE_DESCRIPTION_API_KEY}
  base_url: ${IMAGE_DESCRIPTION_BASE_URL}
  model: ${IMAGE_DESCRIPTION_MODEL:-gpt-4o}

# File type conversion rules
converters:
  extensions:
    ".pdf": mineru
    ".docx": markitdown
    ".pptx": markitdown

Pipeline Configuration

Each pipeline contains:

  • source: Data source configuration
  • destination: Destination configuration
  • steps: List of processing steps
  • options: Optional configuration (resume, sync, etc.)

Environment Variables

Create a .env file (only needed when using HindSight Memory or image description):

# HindSight Memory configuration
HINDSIGHT_API_URL=http://localhost:8888
HINDSIGHT_API_KEY=your_api_key
HINDSIGHT_BANK_ID=your_bank_id

# Image description API configuration
IMAGE_DESCRIPTION_API_KEY=your_api_key
IMAGE_DESCRIPTION_BASE_URL=http://localhost:8002/v1
IMAGE_DESCRIPTION_MODEL=gpt-4o

Environment Variable Interpolation

Supports ${VAR} and ${VAR:-default} syntax:

api_key: ${API_KEY}                          # Required
model: ${MODEL:-gpt-4o}                      # Default value
base_url: ${BASE_URL:-http://localhost:8080} # Default value

Use Cases

Use Case 1: Download documents from DingTalk Knowledge Base to local

Before using DingTalk Knowledge Base, install dws (official DingTalk CLI) and complete authentication:

# Install dws (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/DingTalk-Real-AI/dingtalk-workspace-cli/main/scripts/install.sh | sh

# Or via npm
npm install -g dingtalk-workspace-cli

# Authenticate (browser QR code)
dws auth login

# Headless environment use device flow
dws auth login --device

If your organization has not enabled CLI access, scan the QR code and apply to the administrator as prompted. Administrators can enable it in DingTalk Open Platform → "CLI Access Management".

Configure the pipeline:

pipelines:
  - name: dingtalk-download
    source:
      dingtalk:
        # Use knowledge base name (program will auto-query ID)
        space: "Product Knowledge Base"
        # Or use space_id directly
        # space_id: "kfiwoue83nkxQXyA"
        folders: ["Product Planning Materials"]
        include_types: [DOCUMENT, ALIDOC]
    destination:
      localdrive:
        output_dir: ./output/dingtalk
    steps: []

Use Case 2: Local document format conversion

pipelines:
  - name: convert-docs
    source:
      localdrive:
        input_dir: ./output/dingtalk
        include: ["*.docx"]
    destination:
      localdrive:
        output_dir: ./output/markdown
    steps:
      - convert          # Convert to markdown
      - image_description # Add descriptions to images

Use Case 3: Write local documents to HindSight Memory

pipelines:
  - name: to-hindsight
    source:
      localdrive:
        input_dir: ./output/markdown
        include: ["*.md"]
    destination:
      hindsight:
        context_prefix: "Product Knowledge Base"
    steps: []

Use Case 4: ALL IN ONE

pipelines:
  - name: full-pipeline
    source:
      dingtalk:
        space: "Product Knowledge Base"
    destination:
      hindsight:
        context_prefix: "Knowledge Base"
    steps:
      - convert
      - image_description

Available Components

Source

  • dingtalk: DingTalk Knowledge Base
  • localdrive: Local file system

Destination

  • localdrive: Local file system
  • hindsight: HindSight Memory

Step

  • convert: Document format conversion
  • image_description: Image description generation

Converter

  • markitdown: Common office documents
  • mineru: High-quality PDF conversion

State Management

docupipe maintains state files ({source}_{dest}_state.json) for each source-dest combination, recording:

  • Processed document IDs
  • Document hashes (for change detection)

Run Modes

  • Default mode (full): Process all documents
  • --resume: Skip already processed documents, continue from pending
  • incremental: Only process newly added documents
  • mirror: Detect changes (mtime/hash) + remove deleted documents

Architecture

source.list_documents() → [DocumentMeta]
  → filter (resume skips processed / sync only syncs changes)
    → source.fetch(meta) → Document
      → steps process sequentially (convert → image_description → ...)
        → dest.write(doc)
          → state.mark_done()

Development

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run single test file
python -m pytest tests/test_pipeline.py -v

Adding New Components

All components use decorators for registration. Adding a new component requires three steps:

  1. Implement the abstract base class
  2. Add the decorator: @register_source("name")
  3. Import in init.py

Example:

# sources/custom.py
from docupipe.sources.base import BaseSource
from docupipe.sources import register_source

@register_source("custom")
class CustomSource(BaseSource):
    def list_documents(self):
        # Implement document list logic
        pass

    def fetch(self, meta):
        # Implement document fetch logic
        pass

Documentation

See docs/ for detailed documentation:

Type Document
📖 Tutorial Quick Start — DingTalk to Hindsight Memory
📋 How-to Configure Pipeline, Add New Component
📚 Reference Configuration, API Reference, Components
💡 Explanation Architecture, Run Modes

Dependencies

  • Python 3.11+
  • Click (CLI framework)
  • Rich (Terminal output)
  • PyYAML (Configuration parsing)
  • markitdown (Document conversion)
  • MinerU (PDF OCR conversion with embedded images)
  • hindsight-client (HindSight Memory client)
  • OpenAI SDK (Image description)

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docupipe-0.1.2.tar.gz (65.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docupipe-0.1.2-py3-none-any.whl (50.0 kB view details)

Uploaded Python 3

File details

Details for the file docupipe-0.1.2.tar.gz.

File metadata

  • Download URL: docupipe-0.1.2.tar.gz
  • Upload date:
  • Size: 65.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8d596a4649f5745223db75f0318d5e49da0afaf8f3df774250c10a046311a2ec
MD5 5076bf87f123dddfd7e68c6f0a79237c
BLAKE2b-256 a5371f8ab73d55ab5fb78ce2b85a6a297d180c71461233c7d989beddd5a840d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.2.tar.gz:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docupipe-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: docupipe-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 50.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 358ce197dbea58003198fe392f20ade83246c82018090ad1e6db8ec32104de82
MD5 276f54037d5ffdad29ba1383e3d0097e
BLAKE2b-256 6b332a042eb422789294ff29494e4991fb9b7f80be331dd75f7f900b327b3882

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.2-py3-none-any.whl:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page