Document-native DAG runner for preprocessing PDFs, Office files, and email messages into structured evidence artifacts.
Project description
docflow-sager
docflow-sager is a document-native workflow engine for preprocessing files into structured evidence artifacts, plain-text exports, and downstream indexes.
The package name on PyPI is docflow-sager, while the Python import stays:
import docflow
What It Does
DocFlow is designed for document pipelines in the same way PromptFlow is designed for prompt pipelines.
Instead of chaining prompt nodes, DocFlow chains document-processing steps such as:
- scanning files
- parsing documents
- normalizing parser output into evidence atoms
- enriching metadata
- constructing an evidence graph
- building semantic, structural, and spatial indexes
- exporting structured outputs
It is a strong fit for:
- document intelligence
- preprocessing large file corpora
- evidence-aware retrieval pipelines
- text extraction and indexing workflows
- AI-ready document transformation
Supported Input Types
DocFlow currently supports:
pdfdocxdocxlsxxlsmsg
Implementation notes:
pdfusespypdfdocxuses Open XML parsingdocuses OLE-based text extraction as a pragmatic fallbackxlsxuses Open XML parsingxlsusesxlrdmsgusesextract-msg
Installation
pip install docflow-sager
Quick Start
Python API
from docflow import run_flow
result = run_flow(
"document_preprocess.flow.dag.yaml",
inputs={
"source_dir": "/path/to/documents",
"output_dir": "/tmp/docflow_run",
},
)
print(result["flow_name"])
print(result["final_output"])
CLI
The package exposes a docflow command-line entrypoint.
Run a flow:
docflow run document_preprocess.flow.dag.yaml --source-dir /path/to/documents --output-dir /tmp/docflow_run --trace
Inspect the compiled graph:
docflow graph document_preprocess.flow.dag.yaml
Example Flow
name: document_preprocess
variables:
dataset_name: docflow_demo
output_dir: docflow_runs/demo
nodes:
- name: scan
step: scan_documents
config:
source_dir: ${inputs.source_dir}
file_types: [pdf, docx, doc, xlsx, xls, msg]
- name: parse
step: parse_documents
depends_on: [scan]
config:
dataset_name: ${dataset_name}
- name: normalize
step: normalize_atoms
depends_on: [parse]
- name: metadata
step: enrich_metadata
depends_on: [normalize]
- name: graph
step: build_evidence_graph
depends_on: [metadata]
- name: indexes
step: build_indexes
depends_on: [graph]
- name: write
step: write_outputs
depends_on: [graph, indexes]
config:
output_dir: ${output_dir}
outputs:
final_node: write
Output Structure
A typical DocFlow run writes:
documents/*.jsontext/*.txtindexes/semantic_index.jsonindexes/structural_index.jsonindexes/spatial_index.jsonmanifest.json
The manifest records:
- processed document count
- parse error count
- per-document stats
- skipped-file error details
Core Concepts
Evidence Atoms
Each extracted line or structured unit becomes an evidence atom containing:
- document id
- page number
- text
- reading order
- parser provenance
- confidence
- role label
Evidence Graph
DocFlow constructs lightweight graph edges between atoms, including:
- adjacency in reading order
- containment under headings
This makes the outputs better suited for downstream retrieval and reasoning than plain flattened text alone.
Multi-Format Exports
DocFlow outputs can later be exported or indexed as:
- plain text
- JSON
- CSV
- XLSX
- SDF-compatible structured artifacts
Notes and Limitations
.docsupport is text-oriented, not layout-faithful.msgsupport focuses on message metadata and body extraction- malformed PDFs are skipped and logged in the manifest instead of aborting the full batch
- parser quality depends on the underlying format adapter
Package Identity
- PyPI package:
docflow-sager - Python import:
docflow - CLI command:
docflow
Repository
Source repository:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docflow_sager-0.1.1.tar.gz.
File metadata
- Download URL: docflow_sager-0.1.1.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84e1caa01c7582488f743c5632fe6b855786c80e61027fe566e2a3114876bef6
|
|
| MD5 |
53e702c0d8917e57ecea8d12a4b0b58e
|
|
| BLAKE2b-256 |
b8cce2123e025e7e9009764407111537fde30fa01afde5c0b12a16b2128f6dc4
|
File details
Details for the file docflow_sager-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docflow_sager-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
853b4d08800d32aa80decfa48229a00614ea5733548275a0b5c94e069c74aad8
|
|
| MD5 |
98bb639d8196315e0b3cc937be8d6caa
|
|
| BLAKE2b-256 |
1b105ca7772e10105edca3dd6bd73f7932dffdbe3935c954d533f914305d2d76
|