Skip to main content

Document-native DAG runner for preprocessing PDFs, Office files, and email messages into structured evidence artifacts.

Project description

docflow-sager

DocFlow Logo

docflow-sager is a document-native workflow engine for preprocessing files into structured evidence artifacts, plain-text exports, and downstream indexes.

The package name on PyPI is docflow-sager, while the Python import stays:

import docflow

What It Does

DocFlow is designed for document pipelines in the same way PromptFlow is designed for prompt pipelines.

Instead of chaining prompt nodes, DocFlow chains document-processing steps such as:

  • scanning files
  • parsing documents
  • normalizing parser output into evidence atoms
  • enriching metadata
  • constructing an evidence graph
  • building semantic, structural, and spatial indexes
  • exporting structured outputs

It is a strong fit for:

  • document intelligence
  • preprocessing large file corpora
  • evidence-aware retrieval pipelines
  • text extraction and indexing workflows
  • AI-ready document transformation

Supported Input Types

DocFlow currently supports:

  • pdf
  • docx
  • doc
  • xlsx
  • xls
  • msg

Implementation notes:

  • pdf uses pypdf
  • docx uses Open XML parsing
  • doc uses OLE-based text extraction as a pragmatic fallback
  • xlsx uses Open XML parsing
  • xls uses xlrd
  • msg uses extract-msg

Installation

pip install docflow-sager

Quick Start

Python API

from docflow import run_flow

result = run_flow(
    "document_preprocess.flow.dag.yaml",
    inputs={
        "source_dir": "/path/to/documents",
        "output_dir": "/tmp/docflow_run",
    },
)

print(result["flow_name"])
print(result["final_output"])

CLI

The package exposes a docflow command-line entrypoint.

Run a flow:

docflow run document_preprocess.flow.dag.yaml --source-dir /path/to/documents --output-dir /tmp/docflow_run --trace

Inspect the compiled graph:

docflow graph document_preprocess.flow.dag.yaml

Example Flow

name: document_preprocess
variables:
  dataset_name: docflow_demo
  output_dir: docflow_runs/demo

nodes:
  - name: scan
    step: scan_documents
    config:
      source_dir: ${inputs.source_dir}
      file_types: [pdf, docx, doc, xlsx, xls, msg]

  - name: parse
    step: parse_documents
    depends_on: [scan]
    config:
      dataset_name: ${dataset_name}

  - name: normalize
    step: normalize_atoms
    depends_on: [parse]

  - name: metadata
    step: enrich_metadata
    depends_on: [normalize]

  - name: graph
    step: build_evidence_graph
    depends_on: [metadata]

  - name: indexes
    step: build_indexes
    depends_on: [graph]

  - name: write
    step: write_outputs
    depends_on: [graph, indexes]
    config:
      output_dir: ${output_dir}

outputs:
  final_node: write

Output Structure

A typical DocFlow run writes:

  • documents/*.json
  • text/*.txt
  • indexes/semantic_index.json
  • indexes/structural_index.json
  • indexes/spatial_index.json
  • manifest.json

The manifest records:

  • processed document count
  • parse error count
  • per-document stats
  • skipped-file error details

Core Concepts

Evidence Atoms

Each extracted line or structured unit becomes an evidence atom containing:

  • document id
  • page number
  • text
  • reading order
  • parser provenance
  • confidence
  • role label

Evidence Graph

DocFlow constructs lightweight graph edges between atoms, including:

  • adjacency in reading order
  • containment under headings

This makes the outputs better suited for downstream retrieval and reasoning than plain flattened text alone.

Multi-Format Exports

DocFlow outputs can later be exported or indexed as:

  • plain text
  • JSON
  • CSV
  • XLSX
  • SDF-compatible structured artifacts

Notes and Limitations

  • .doc support is text-oriented, not layout-faithful
  • .msg support focuses on message metadata and body extraction
  • malformed PDFs are skipped and logged in the manifest instead of aborting the full batch
  • parser quality depends on the underlying format adapter

Package Identity

  • PyPI package: docflow-sager
  • Python import: docflow
  • CLI command: docflow

Repository

Source repository:

https://github.com/Meet2147/SAGER

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docflow_sager-0.1.1.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docflow_sager-0.1.1-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file docflow_sager-0.1.1.tar.gz.

File metadata

  • Download URL: docflow_sager-0.1.1.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for docflow_sager-0.1.1.tar.gz
Algorithm Hash digest
SHA256 84e1caa01c7582488f743c5632fe6b855786c80e61027fe566e2a3114876bef6
MD5 53e702c0d8917e57ecea8d12a4b0b58e
BLAKE2b-256 b8cce2123e025e7e9009764407111537fde30fa01afde5c0b12a16b2128f6dc4

See more details on using hashes here.

File details

Details for the file docflow_sager-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docflow_sager-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for docflow_sager-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 853b4d08800d32aa80decfa48229a00614ea5733548275a0b5c94e069c74aad8
MD5 98bb639d8196315e0b3cc937be8d6caa
BLAKE2b-256 1b105ca7772e10105edca3dd6bd73f7932dffdbe3935c954d533f914305d2d76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page