Skip to main content

State-of-the-art document to HTML converter using deep learning

Project description

Reflow

State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.

Install

pip install -e .

Optional ML features (layout detection, OCR, table transformer):

pip install -e '.[ml]'

Quick Start

reflow convert input.pdf -o output.html

With OCR and guardrails:

reflow convert input.pdf \
	--ocr \
	--max-pages 200 \
	--max-output-mb 50 \
	--max-raster-mp 40 \
	--device cpu \
	-o output.html

Reliability Guardrails

Reflow now supports resource limits to make production behavior predictable:

  • max_pages: fail fast when document scope is too large.
  • max_output_bytes: fail fast when generated HTML exceeds configured size.
  • max_raster_megapixels: caps raster operations to avoid memory spikes.
  • Atomic output writes: output files are written safely via temp-file replace.

Observability

  • Structured diagnostics callback (diagnostics_callback) can capture stage timings and pipeline stats.
  • CLI prints timing/stats summary by default (--emit-stats / --no-emit-stats).

Edge Case Handling

  • Invalid page selections are validated against document page count.
  • Scanned pages can trigger OCR automatically when no extractable text is found.
  • Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
  • Broken text encoding spans are rasterized to preserve visual fidelity.

API

import reflow

html = reflow.convert(
		"sample.pdf",
		ocr=True,
		layout_detection=True,
		table_recognition=True,
		max_pages=100,
		max_output_bytes=20 * 1024 * 1024,
)

Testing

pytest -q

With coverage gate (same as CI):

pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80

Golden snapshot tests:

pytest -m golden

Update golden snapshots intentionally:

REFLOW_UPDATE_GOLDENS=1 pytest -m golden

Benchmarking

Single document:

reflow benchmark sample.pdf --repeat 3 --output-json bench.json

Directory benchmark:

reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json

Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.

CI

GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.

Notes

  • Python 3.12+
  • Best results for complex layouts require optional ML dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreflow-0.2.0.tar.gz (64.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreflow-0.2.0-py3-none-any.whl (71.8 kB view details)

Uploaded Python 3

File details

Details for the file pyreflow-0.2.0.tar.gz.

File metadata

  • Download URL: pyreflow-0.2.0.tar.gz
  • Upload date:
  • Size: 64.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for pyreflow-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c78fc0784ad9441d203586618aca227e054d3969d73bb5741eee8f7fc8f0f3d5
MD5 f8cea9898a3b43e038e57c814ec77e07
BLAKE2b-256 4fcb4d211ec29ba10cc9b1cbf58d7cf55a44ed4f92b321e5a917d6c991003ce1

See more details on using hashes here.

File details

Details for the file pyreflow-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pyreflow-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 71.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for pyreflow-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e39b0910c7481f928240341d8f0141a7e79d97c159ace2362031421597c6c6f
MD5 8540614825d200dbdc0935cf8b9adefd
BLAKE2b-256 a84f9ca96404ed0a520cf87dddff0ba2f3e571cb224afb44d6e7ec8e3b905450

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page