Skip to main content

State-of-the-art document to HTML converter using deep learning

Project description

Reflow

State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.

Install

pip install -e .

Optional ML features (layout detection, OCR, table transformer):

pip install -e '.[ml]'

Quick Start

reflow convert input.pdf -o output.html

With OCR and guardrails:

reflow convert input.pdf \
	--ocr \
	--max-pages 200 \
	--max-output-mb 50 \
	--max-raster-mp 40 \
	--device cpu \
	-o output.html

Reliability Guardrails

Reflow now supports resource limits to make production behavior predictable:

  • max_pages: fail fast when document scope is too large.
  • max_output_bytes: fail fast when generated HTML exceeds configured size.
  • max_raster_megapixels: caps raster operations to avoid memory spikes.
  • Atomic output writes: output files are written safely via temp-file replace.

Observability

  • Structured diagnostics callback (diagnostics_callback) can capture stage timings and pipeline stats.
  • CLI prints timing/stats summary by default (--emit-stats / --no-emit-stats).

Edge Case Handling

  • Invalid page selections are validated against document page count.
  • Scanned pages can trigger OCR automatically when no extractable text is found.
  • Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
  • Broken text encoding spans are rasterized to preserve visual fidelity.

API

import reflow

html = reflow.convert(
		"sample.pdf",
		ocr=True,
		layout_detection=True,
		table_recognition=True,
		max_pages=100,
		max_output_bytes=20 * 1024 * 1024,
)

Testing

pytest -q

With coverage gate (same as CI):

pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80

Golden snapshot tests:

pytest -m golden

Update golden snapshots intentionally:

REFLOW_UPDATE_GOLDENS=1 pytest -m golden

Benchmarking

Single document:

reflow benchmark sample.pdf --repeat 3 --output-json bench.json

Directory benchmark:

reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json

Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.

CI

GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.

Notes

  • Python 3.12+
  • Best results for complex layouts require optional ML dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreflow-0.3.0.tar.gz (167.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreflow-0.3.0-py3-none-any.whl (71.5 kB view details)

Uploaded Python 3

File details

Details for the file pyreflow-0.3.0.tar.gz.

File metadata

  • Download URL: pyreflow-0.3.0.tar.gz
  • Upload date:
  • Size: 167.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for pyreflow-0.3.0.tar.gz
Algorithm Hash digest
SHA256 50ab6567637733b14623be2492444b9b973827a45e2a306cf1ef8d44a525aa42
MD5 d662b4cbb78fe5a647ee2c70ff60fcf0
BLAKE2b-256 944a51c5d317539fca758b32a400b1ca8054d55b1a4e97fce0aae266a21e1854

See more details on using hashes here.

File details

Details for the file pyreflow-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pyreflow-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 71.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for pyreflow-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea21cd9396892d1392fe00af5fd47047febc8cf30e37286b89409c8724c12f32
MD5 a2742a74eb5fae53ab143fc28cdfdc13
BLAKE2b-256 94d6016911e2e56359a228b17cb1a73e894ee8e19b91f93cd400d815370d7900

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page