A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems

These details have not been verified by PyPI

Project description

docupipe

A universal document transfer and processing tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems. Inspired by KETTLE, it treats documents and their attachments as an atomic Bundle that flows through the pipeline and is processed incrementally.

Why docupipe?

In the age of AI, document management faces many challenges:

Format conversion: Incompatible document formats between different systems
Content migration: Batch document migration during knowledge base relocation or system switching
Intelligent processing: Preparing standardized document content for knowledge graphs and retrieval systems
Location transfer: Document transfer between different storage systems

docupipe provides a universal, extensible framework to solve these problems.

Key Features

Plugin architecture: Four types of pluggable components: Source, Destination, Step, and Converter
YAML configuration: Declarative configuration with environment variable interpolation
State management: Support for resume and incremental sync
Multiple document sources: DingTalk Knowledge Base, local file system, etc.
Multiple destination systems: Local files, HindSight Memory, etc.
Format conversion: Integration with markitdown, MinerU, and other conversion engines
Intelligent processing: AI-powered image description and other processing steps

Installation

Via pip (recommended)

pip install docupipe

For PDF with embedded images (requires OCR), install the optional dependency:

pip install "docupipe[mineru]"

From source

# Clone the repository
git clone https://github.com/liling/docupipe.git
cd docupipe

# Install dependencies (uv recommended)
pip install uv
uv pip install -e ".[dev]"

# For PDF with embedded images (requires OCR)
uv pip install -e ".[mineru]"

# Or install all optional dependencies
uv pip install -e ".[all]"

# Or use pip
pip install -e ".[dev]"
pip install -e ".[mineru]"  # PDF support

Quick Start

The following example uses local files as both source and destination, requiring no external dependencies.

1. Prepare configuration file

Create docupipe.yaml:

pipelines:
  - name: quick-start
    source:
      localdrive:
        input_dir: ./input
        include: ["*.md"]
    destination:
      localdrive:
        output_dir: ./output
    steps: []

2. Prepare test files

mkdir -p input output
echo "Hello, docupipe!" > input/hello.md

3. Run the pipeline

python -m docupipe run

View the output:

cat output/hello.md

Command Line Options

python -m docupipe run [OPTIONS]

Options:
  --config PATH                 Configuration file path (default: docupipe.yaml)
  --pipeline NAME               Specify pipeline name
  --mode MODE                   Run mode (full/incremental/mirror)
  --resume                      Full mode resume from checkpoint
  --change-detection STRATEGY   Change detection strategy (mtime/hash, mirror mode only)
  --dry-run                     Print only, don't execute
  --state-dir PATH              State file directory (default: ./.state)
  --log-level LEVEL             Log level (DEBUG/INFO/WARNING/ERROR)

# List available components
python -m docupipe sources       # List all Sources
python -m docupipe destinations  # List all Destinations

Configuration

Global Configuration

# HindSight Memory configuration
hindsight:
  api_url: ${HINDSIGHT_API_URL}
  api_key: ${HINDSIGHT_API_KEY}
  bank_id: ${HINDSIGHT_BANK_ID}

# Image description configuration
image_description:
  api_key: ${IMAGE_DESCRIPTION_API_KEY}
  base_url: ${IMAGE_DESCRIPTION_BASE_URL}
  model: ${IMAGE_DESCRIPTION_MODEL:-gpt-4o}

# File type conversion rules
converters:
  extensions:
    ".pdf": mineru
    ".docx": markitdown
    ".pptx": markitdown

Pipeline Configuration

Each pipeline contains:

source: Data source configuration
destination: Destination configuration
steps: List of processing steps
options: Optional configuration (resume, sync, etc.)

Environment Variables

Create a .env file (only needed when using HindSight Memory or image description):

# HindSight Memory configuration
HINDSIGHT_API_URL=http://localhost:8888
HINDSIGHT_API_KEY=your_api_key
HINDSIGHT_BANK_ID=your_bank_id

# Image description API configuration
IMAGE_DESCRIPTION_API_KEY=your_api_key
IMAGE_DESCRIPTION_BASE_URL=http://localhost:8002/v1
IMAGE_DESCRIPTION_MODEL=gpt-4o

Environment Variable Interpolation

Supports ${VAR} and ${VAR:-default} syntax:

api_key: ${API_KEY}                          # Required
model: ${MODEL:-gpt-4o}                      # Default value
base_url: ${BASE_URL:-http://localhost:8080} # Default value

Use Cases

Use Case 1: Download documents from DingTalk Knowledge Base to local

Before using DingTalk Knowledge Base, install dws (official DingTalk CLI) and complete authentication:

# Install dws (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/DingTalk-Real-AI/dingtalk-workspace-cli/main/scripts/install.sh | sh

# Or via npm
npm install -g dingtalk-workspace-cli

# Authenticate (browser QR code)
dws auth login

# Headless environment use device flow
dws auth login --device

If your organization has not enabled CLI access, scan the QR code and apply to the administrator as prompted. Administrators can enable it in DingTalk Open Platform → "CLI Access Management".

Configure the pipeline:

pipelines:
  - name: dingtalk-download
    source:
      dingtalk:
        # Use knowledge base name (program will auto-query ID)
        space: "Product Knowledge Base"
        # Or use space_id directly
        # space_id: "kfiwoue83nkxQXyA"
        folders: ["Product Planning Materials"]
        include_types: [DOCUMENT, ALIDOC]
    destination:
      localdrive:
        output_dir: ./output/dingtalk
    steps: []

Use Case 2: Local document format conversion

pipelines:
  - name: convert-docs
    source:
      localdrive:
        input_dir: ./output/dingtalk
        include: ["*.docx"]
    destination:
      localdrive:
        output_dir: ./output/markdown
    steps:
      - convert          # Convert to markdown
      - image_description # Add descriptions to images

Use Case 3: Write local documents to HindSight Memory

pipelines:
  - name: to-hindsight
    source:
      localdrive:
        input_dir: ./output/markdown
        include: ["*.md"]
    destination:
      hindsight:
        context_prefix: "Product Knowledge Base"
    steps: []

Use Case 4: ALL IN ONE

pipelines:
  - name: full-pipeline
    source:
      dingtalk:
        space: "Product Knowledge Base"
    destination:
      hindsight:
        context_prefix: "Knowledge Base"
    steps:
      - convert
      - image_description

Available Components

Source

dingtalk: DingTalk Knowledge Base (wiki/doc dual mode)
localdrive: Local file system
tencent: Tencent Docs (MCP protocol)

Destination

localdrive: Local file system
hindsight: HindSight Memory

Step

convert: Document format conversion (via Converter)
image_description: AI-powered image description
excel_structured: Excel → structured Markdown tables
resolve_attachments: Resolve local file references in Markdown
s3_upload: Upload attachments to S3-compatible storage
tencent_delete: Delete processed Tencent docs (use in finalize_steps)

Converter

markitdown: Common office documents
mineru: High-quality PDF conversion (with OCR)

State Management

docupipe maintains state files ({source}_{dest}_state.json) for each source-dest combination, recording:

Processed document IDs
Document hashes (for change detection)

Run Modes

full: Call source.list() for all documents, process each one
full + --resume: Skip list(), continue from pending state
incremental: List all, only process newly added documents
mirror: Detect changes (mtime/hash) + remove deleted documents

Architecture

source.list() → [BundleMeta]
  → filter (resume skips done / incremental only new / mirror detects changes)
    → source.fetch(meta) → Bundle
      → steps process sequentially (convert → image_description → ...)
        → dest.write(bundle)
          → state.mark_done()
            → post_steps (optional, e.g. delete source)
After all documents:
  → finalize_steps (batch post-processing, e.g. Tencent doc cleanup)

Development

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run single test file
python -m pytest tests/test_pipeline.py -v

Adding New Components

All components use decorators for registration. Adding a new component requires three steps:

Implement the abstract base class
Add the decorator: @register_source("name")
Import in init.py

Example (Source):

# sources/custom.py
from docupipe.models import Bundle, BundleMeta
from docupipe.sources import register_source
from docupipe.sources.base import SourceBase

@register_source("custom")
class CustomSource(SourceBase):
    def list(self) -> list[BundleMeta]:
        # Return document metadata list
        ...

    def fetch(self, meta: BundleMeta) -> Bundle:
        # Fetch document content by meta
        ...

See Add New Component for details.

Documentation

See docs/ for detailed documentation:

Type	Document
📖 Tutorial	Quick Start — DingTalk to Hindsight Memory
📋 How-to	Configure Pipeline, Add New Component
📚 Reference	Configuration, API Reference, Components
💡 Explanation	Architecture, Run Modes

Dependencies

Python 3.11+
Click (CLI framework)
Rich (Terminal output)
PyYAML (Configuration parsing)
markitdown (Document conversion)
MinerU (PDF OCR conversion with embedded images)
hindsight-client (HindSight Memory client)
OpenAI SDK (Image description)

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

May 22, 2026

0.1.2

May 20, 2026

0.1.1

May 16, 2026

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docupipe-0.1.3.tar.gz (74.6 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docupipe-0.1.3-py3-none-any.whl (54.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file docupipe-0.1.3.tar.gz.

File metadata

Download URL: docupipe-0.1.3.tar.gz
Upload date: May 22, 2026
Size: 74.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`1b4a73c3cdcff15aa590d909d2d2d4c9c7f1e1a0045029338178c8eec3e2acd1`
MD5	`c62b660628c2a700afbf0c704f6161f8`
BLAKE2b-256	`fe8faa9437c7ed1def012da8bfc3c14aa1176789dccc7157d1d9fb4d6d29ce26`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.3.tar.gz:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docupipe-0.1.3.tar.gz
- Subject digest: 1b4a73c3cdcff15aa590d909d2d2d4c9c7f1e1a0045029338178c8eec3e2acd1
- Sigstore transparency entry: 1602098188
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: liling/docupipe@6ca095fc65db1f5970aa70a50eb50ee9e415c2e7
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/liling
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6ca095fc65db1f5970aa70a50eb50ee9e415c2e7
- Trigger Event: push

File details

Details for the file docupipe-0.1.3-py3-none-any.whl.

File metadata

Download URL: docupipe-0.1.3-py3-none-any.whl
Upload date: May 22, 2026
Size: 54.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc7a919ec6f26ff5319cff52317dfe21f238c6d8f2f84c3111d5645f28d9edfb`
MD5	`8787b35575622968d7218f8fcd1ff489`
BLAKE2b-256	`c2d0316e3b754b251176d5a775f2976e0d665bb76be515e723793f1f91d3c8c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.3-py3-none-any.whl:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docupipe-0.1.3-py3-none-any.whl
- Subject digest: dc7a919ec6f26ff5319cff52317dfe21f238c6d8f2f84c3111d5645f28d9edfb
- Sigstore transparency entry: 1602098204
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: liling/docupipe@6ca095fc65db1f5970aa70a50eb50ee9e415c2e7
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/liling
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6ca095fc65db1f5970aa70a50eb50ee9e415c2e7
- Trigger Event: push

docupipe 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

docupipe

Why docupipe?

Key Features

Installation

Via pip (recommended)

From source

Quick Start

1. Prepare configuration file

2. Prepare test files

3. Run the pipeline

Command Line Options

Configuration

Global Configuration

Pipeline Configuration

Environment Variables

Environment Variable Interpolation

Use Cases

Use Case 1: Download documents from DingTalk Knowledge Base to local

Use Case 2: Local document format conversion

Use Case 3: Write local documents to HindSight Memory

Use Case 4: ALL IN ONE

Available Components

Source

Destination

Step

Converter

State Management

Run Modes

Architecture

Development

Running Tests

Adding New Components

Documentation

Dependencies

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance