Document-native DAG runner for preprocessing PDFs, Office files, and email messages into structured evidence artifacts.

Project description

docflow-sager

DocFlow Logo

docflow-sager is a document-native workflow engine for preprocessing files into structured evidence artifacts, plain-text exports, and downstream indexes.

The package name on PyPI is docflow-sager, while the Python import stays:

import docflow

What It Does

DocFlow is designed for document pipelines in the same way PromptFlow is designed for prompt pipelines.

Instead of chaining prompt nodes, DocFlow chains document-processing steps such as:

scanning files
parsing documents
normalizing parser output into evidence atoms
enriching metadata
constructing an evidence graph
building semantic, structural, and spatial indexes
exporting structured outputs

It is a strong fit for:

document intelligence
preprocessing large file corpora
evidence-aware retrieval pipelines
text extraction and indexing workflows
AI-ready document transformation

Supported Input Types

DocFlow currently supports:

pdf
docx
doc
xlsx
xls
msg

Implementation notes:

pdf uses pypdf
docx uses Open XML parsing
doc uses OLE-based text extraction as a pragmatic fallback
xlsx uses Open XML parsing
xls uses xlrd
msg uses extract-msg

Installation

pip install docflow-sager

Quick Start

Python API

from docflow import run_flow

result = run_flow(
    "document_preprocess.flow.dag.yaml",
    inputs={
        "source_dir": "/path/to/documents",
        "output_dir": "/tmp/docflow_run",
    },
)

print(result["flow_name"])
print(result["final_output"])

CLI

The package exposes a docflow command-line entrypoint.

Run a flow:

docflow run document_preprocess.flow.dag.yaml --source-dir /path/to/documents --output-dir /tmp/docflow_run --trace

Inspect the compiled graph:

docflow graph document_preprocess.flow.dag.yaml

Example Flow

name: document_preprocess
variables:
  dataset_name: docflow_demo
  output_dir: docflow_runs/demo

nodes:
  - name: scan
    step: scan_documents
    config:
      source_dir: ${inputs.source_dir}
      file_types: [pdf, docx, doc, xlsx, xls, msg]

  - name: parse
    step: parse_documents
    depends_on: [scan]
    config:
      dataset_name: ${dataset_name}

  - name: normalize
    step: normalize_atoms
    depends_on: [parse]

  - name: metadata
    step: enrich_metadata
    depends_on: [normalize]

  - name: graph
    step: build_evidence_graph
    depends_on: [metadata]

  - name: indexes
    step: build_indexes
    depends_on: [graph]

  - name: write
    step: write_outputs
    depends_on: [graph, indexes]
    config:
      output_dir: ${output_dir}

outputs:
  final_node: write

Output Structure

A typical DocFlow run writes:

documents/*.json
text/*.txt
indexes/semantic_index.json
indexes/structural_index.json
indexes/spatial_index.json
manifest.json

The manifest records:

processed document count
parse error count
per-document stats
skipped-file error details

Core Concepts

Evidence Atoms

Each extracted line or structured unit becomes an evidence atom containing:

document id
page number
text
reading order
parser provenance
confidence
role label

Evidence Graph

DocFlow constructs lightweight graph edges between atoms, including:

adjacency in reading order
containment under headings

This makes the outputs better suited for downstream retrieval and reasoning than plain flattened text alone.

Multi-Format Exports

DocFlow outputs can later be exported or indexed as:

plain text
JSON
CSV
XLSX
SDF-compatible structured artifacts

Notes and Limitations

.doc support is text-oriented, not layout-faithful
.msg support focuses on message metadata and body extraction
malformed PDFs are skipped and logged in the manifest instead of aborting the full batch
parser quality depends on the underlying format adapter

Package Identity

PyPI package: docflow-sager
Python import: docflow
CLI command: docflow

Repository

Source repository:

https://github.com/Meet2147/SAGER

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Mar 21, 2026

0.1.0

Mar 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docflow_sager-0.1.1.tar.gz (17.1 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docflow_sager-0.1.1-py3-none-any.whl (18.0 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file docflow_sager-0.1.1.tar.gz.

File metadata

Download URL: docflow_sager-0.1.1.tar.gz
Upload date: Mar 21, 2026
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for docflow_sager-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`84e1caa01c7582488f743c5632fe6b855786c80e61027fe566e2a3114876bef6`
MD5	`53e702c0d8917e57ecea8d12a4b0b58e`
BLAKE2b-256	`b8cce2123e025e7e9009764407111537fde30fa01afde5c0b12a16b2128f6dc4`

See more details on using hashes here.

File details

Details for the file docflow_sager-0.1.1-py3-none-any.whl.

File metadata

Download URL: docflow_sager-0.1.1-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for docflow_sager-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`853b4d08800d32aa80decfa48229a00614ea5733548275a0b5c94e069c74aad8`
MD5	`98bb639d8196315e0b3cc937be8d6caa`
BLAKE2b-256	`1b105ca7772e10105edca3dd6bd73f7932dffdbe3935c954d533f914305d2d76`

See more details on using hashes here.

docflow-sager 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

docflow-sager

What It Does

Supported Input Types

Installation

Quick Start

Python API

CLI

Example Flow

Output Structure

Core Concepts

Evidence Atoms

Evidence Graph

Multi-Format Exports

Notes and Limitations

Package Identity

Repository

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes