Skip to main content

A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems

Project description

docupipe

A universal document transfer pipeline tool that retrieves content from various document sources, processes it through configurable steps, and transfers it to multiple destination systems.

Why docupipe?

In the age of AI, document management faces many challenges:

  • Format conversion: Incompatible document formats between different systems
  • Content migration: Batch document migration during knowledge base relocation or system switching
  • Intelligent processing: Preparing standardized document content for knowledge graphs and retrieval systems
  • Location transfer: Document transfer between different storage systems

docupipe provides a universal, extensible framework to solve these problems.

Key Features

  • Plugin architecture: Four types of pluggable components: Source, Destination, Step, and Converter
  • YAML configuration: Declarative configuration with environment variable interpolation
  • State management: Support for resume and incremental sync
  • Multiple document sources: DingTalk Knowledge Base, local file system, etc.
  • Multiple destination systems: Local files, HindSight Memory, etc.
  • Format conversion: Integration with markitdown, MinerU, and other conversion engines
  • Intelligent processing: AI-powered image description and other processing steps

Installation

Via pip (recommended)

pip install docupipe

For PDF with embedded images (requires OCR), install the optional dependency:

pip install "docupipe[mineru]"

From source

# Clone the repository
git clone <repository-url>
cd docupipe

# Install dependencies (uv recommended)
pip install uv
uv pip install -e ".[dev]"

# For PDF with embedded images (requires OCR)
uv pip install -e ".[mineru]"

# Or install all optional dependencies
uv pip install -e ".[all]"

# Or use pip
pip install -e ".[dev]"
pip install -e ".[mineru]"  # PDF support

Quick Start

The following example uses local files as both source and destination, requiring no external dependencies.

1. Prepare configuration file

Create docupipe.yaml:

pipelines:
  - name: quick-start
    source:
      localdrive:
        input_dir: ./input
        include: ["*.md"]
    destination:
      localdrive:
        output_dir: ./output
    steps: []

2. Prepare test files

mkdir -p input output
echo "Hello, docupipe!" > input/hello.md

3. Run the pipeline

python -m docupipe run

View the output:

cat output/hello.md

Command Line Options

python -m docupipe run [OPTIONS]

Options:
  --config PATH              Configuration file path (default: docupipe.yaml)
  --pipeline NAME            Specify pipeline name
  --resume                   Skip already processed documents
  --sync                     Sync only changed documents
  --dry-run                  Print only, don't execute
  --state-dir PATH           State file directory (default: ./.state)
  --log-level LEVEL          Log level (DEBUG/INFO/WARNING/ERROR)

# List available components
python -m docupipe sources       # List all Sources
python -m docupipe destinations  # List all Destinations

Configuration

Global Configuration

# HindSight Memory configuration
hindsight:
  api_url: ${HINDSIGHT_API_URL}
  api_key: ${HINDSIGHT_API_KEY}
  bank_id: ${HINDSIGHT_BANK_ID}

# Image description configuration
image_description:
  api_key: ${IMAGE_DESCRIPTION_API_KEY}
  base_url: ${IMAGE_DESCRIPTION_BASE_URL}
  model: ${IMAGE_DESCRIPTION_MODEL:-gpt-4o}

# File type conversion rules
converters:
  extensions:
    ".pdf": mineru
    ".docx": markitdown
    ".pptx": markitdown

Pipeline Configuration

Each pipeline contains:

  • source: Data source configuration
  • destination: Destination configuration
  • steps: List of processing steps
  • options: Optional configuration (resume, sync, etc.)

Environment Variables

Create a .env file (only needed when using HindSight Memory or image description):

# HindSight Memory configuration
HINDSIGHT_API_URL=http://localhost:8888
HINDSIGHT_API_KEY=your_api_key
HINDSIGHT_BANK_ID=your_bank_id

# Image description API configuration
IMAGE_DESCRIPTION_API_KEY=your_api_key
IMAGE_DESCRIPTION_BASE_URL=http://localhost:8002/v1
IMAGE_DESCRIPTION_MODEL=gpt-4o

Environment Variable Interpolation

Supports ${VAR} and ${VAR:-default} syntax:

api_key: ${API_KEY}                          # Required
model: ${MODEL:-gpt-4o}                      # Default value
base_url: ${BASE_URL:-http://localhost:8080} # Default value

Use Cases

Use Case 1: Download documents from DingTalk Knowledge Base to local

Before using DingTalk Knowledge Base, install dws (official DingTalk CLI) and complete authentication:

# Install dws (macOS / Linux)
curl -fsSL https://raw.githubusercontent.com/DingTalk-Real-AI/dingtalk-workspace-cli/main/scripts/install.sh | sh

# Or via npm
npm install -g dingtalk-workspace-cli

# Authenticate (browser QR code)
dws auth login

# Headless environment use device flow
dws auth login --device

If your organization has not enabled CLI access, scan the QR code and apply to the administrator as prompted. Administrators can enable it in DingTalk Open Platform → "CLI Access Management".

Configure the pipeline:

pipelines:
  - name: dingtalk-download
    source:
      dingtalk:
        # Use knowledge base name (program will auto-query ID)
        space: "Product Knowledge Base"
        # Or use space_id directly
        # space_id: "kfiwoue83nkxQXyA"
        folders: ["Product Planning Materials"]
        include_types: [DOCUMENT, ALIDOC]
    destination:
      localdrive:
        output_dir: ./output/dingtalk
    steps: []

Use Case 2: Local document format conversion

pipelines:
  - name: convert-docs
    source:
      localdrive:
        input_dir: ./output/dingtalk
        include: ["*.docx"]
    destination:
      localdrive:
        output_dir: ./output/markdown
    steps:
      - convert          # Convert to markdown
      - image_description # Add descriptions to images

Use Case 3: Write local documents to HindSight Memory

pipelines:
  - name: to-hindsight
    source:
      localdrive:
        input_dir: ./output/markdown
        include: ["*.md"]
    destination:
      hindsight:
        context_prefix: "Product Knowledge Base"
    steps: []

Use Case 4: ALL IN ONE

pipelines:
  - name: full-pipeline
    source:
      dingtalk:
        space: "Product Knowledge Base"
    destination:
      hindsight:
        context_prefix: "Knowledge Base"
    steps:
      - convert
      - image_description

Available Components

Source

  • dingtalk: DingTalk Knowledge Base
  • localdrive: Local file system

Destination

  • localdrive: Local file system
  • hindsight: HindSight Memory

Step

  • convert: Document format conversion
  • image_description: Image description generation

Converter

  • markitdown: Common office documents
  • mineru: High-quality PDF conversion

State Management

docupipe maintains state files ({source}_{dest}_state.json) for each source-dest combination, recording:

  • Processed document IDs
  • Document hashes (for change detection)

Run Modes

  • Default mode: Process all documents
  • --resume: Skip already processed documents
  • --sync: Sync only changed documents, remove documents deleted from source

Architecture

source.list_documents() → [DocumentMeta]
  → filter (resume skips processed / sync only syncs changes)
    → source.fetch(meta) → Document
      → steps process sequentially (convert → image_description → ...)
        → dest.write(doc)
          → state.mark_done()

Development

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run single test file
python -m pytest tests/test_pipeline.py -v

Adding New Components

All components use decorators for registration. Adding a new component requires three steps:

  1. Implement the abstract base class
  2. Add the decorator: @register_source("name")
  3. Import in init.py

Example:

# sources/custom.py
from docupipe.sources.base import BaseSource
from docupipe.sources import register_source

@register_source("custom")
class CustomSource(BaseSource):
    def list_documents(self):
        # Implement document list logic
        pass

    def fetch(self, meta):
        # Implement document fetch logic
        pass

Dependencies

  • Python 3.11+
  • Click (CLI framework)
  • Rich (Terminal output)
  • PyYAML (Configuration parsing)
  • markitdown (Document conversion)
  • MinerU (PDF OCR conversion with embedded images)
  • hindsight-client (HindSight Memory client)
  • OpenAI SDK (Image description)

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docupipe-0.1.1.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docupipe-0.1.1-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file docupipe-0.1.1.tar.gz.

File metadata

  • Download URL: docupipe-0.1.1.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f24232ea491144c3c710f1ba96c9e150d036232756b0c97f0a299578ac1ffd99
MD5 faa61ba841aa83cb1ffeff0ed6d8bcb7
BLAKE2b-256 b8f24d5cbc2486520081a05c90b0b761c3fe6c1d90597de2127daac961db855e

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.1.tar.gz:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docupipe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docupipe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docupipe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c8eef47de877e70e2b73c0d57344e0efb11b8b2ee1070b47d754873d8144fc0c
MD5 1fe8c67183092db3a627171e2e27eb33
BLAKE2b-256 988f2c4cf6b93fbbe0ad59e88ad84971c77aeb9aa5f3595692a9c2cbd79e54fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for docupipe-0.1.1-py3-none-any.whl:

Publisher: release.yml on liling/docupipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page