Skip to main content

State-of-the-art document to HTML converter using deep learning

Project description

Reflow

State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.

Install

pip install -e .

Optional ML features (layout detection, OCR, table transformer):

pip install -e '.[ml]'

Quick Start

reflow convert input.pdf -o output.html

With OCR and guardrails:

reflow convert input.pdf \
	--ocr \
	--max-pages 200 \
	--max-output-mb 50 \
	--max-raster-mp 40 \
	--device cpu \
	-o output.html

Reliability Guardrails

Reflow now supports resource limits to make production behavior predictable:

  • max_pages: fail fast when document scope is too large.
  • max_output_bytes: fail fast when generated HTML exceeds configured size.
  • max_raster_megapixels: caps raster operations to avoid memory spikes.
  • Atomic output writes: output files are written safely via temp-file replace.

Observability

  • Structured diagnostics callback (diagnostics_callback) can capture stage timings and pipeline stats.
  • CLI prints timing/stats summary by default (--emit-stats / --no-emit-stats).

Edge Case Handling

  • Invalid page selections are validated against document page count.
  • Scanned pages can trigger OCR automatically when no extractable text is found.
  • Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
  • Broken text encoding spans are rasterized to preserve visual fidelity.

API

import reflow

html = reflow.convert(
		"sample.pdf",
		ocr=True,
		layout_detection=True,
		table_recognition=True,
		max_pages=100,
		max_output_bytes=20 * 1024 * 1024,
)

Testing

pytest -q

With coverage gate (same as CI):

pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80

Golden snapshot tests:

pytest -m golden

Update golden snapshots intentionally:

REFLOW_UPDATE_GOLDENS=1 pytest -m golden

Benchmarking

Single document:

reflow benchmark sample.pdf --repeat 3 --output-json bench.json

Directory benchmark:

reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json

Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.

CI

GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.

Notes

  • Python 3.12+
  • Best results for complex layouts require optional ML dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreflow-0.1.0.tar.gz (163.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreflow-0.1.0-py3-none-any.whl (70.1 kB view details)

Uploaded Python 3

File details

Details for the file pyreflow-0.1.0.tar.gz.

File metadata

  • Download URL: pyreflow-0.1.0.tar.gz
  • Upload date:
  • Size: 163.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyreflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3cfe7e0ffc4486e1553dbd4e1f774ab8528e6ea9679405b5864a5feef3d6ecaf
MD5 02c21bcf1e11896507b7cc878c7c6408
BLAKE2b-256 4f900145975cce5c006cc42c7e4fd5536eee9f0d038ba8b5497377b91e82c41a

See more details on using hashes here.

File details

Details for the file pyreflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyreflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 70.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pyreflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9fd3f7e151ac02f328814d81112bf6f87207c36683f62e576a09b26b29050ea7
MD5 4f8e398daeba83b344e2ed5ded0ddb0b
BLAKE2b-256 d43b64838ad905a14521165b84c8ea5893ed965df1efeaa3dc957b73c2357ad3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page