Skip to main content

State-of-the-art document to HTML converter using deep learning

Project description

Reflow

State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.

Install

pip install -e .

Optional ML features (layout detection, OCR, table transformer):

pip install -e '.[ml]'

Quick Start

reflow convert input.pdf -o output.html

With OCR and guardrails:

reflow convert input.pdf \
	--ocr \
	--max-pages 200 \
	--max-output-mb 50 \
	--max-raster-mp 40 \
	--device cpu \
	-o output.html

Reliability Guardrails

Reflow now supports resource limits to make production behavior predictable:

  • max_pages: fail fast when document scope is too large.
  • max_output_bytes: fail fast when generated HTML exceeds configured size.
  • max_raster_megapixels: caps raster operations to avoid memory spikes.
  • Atomic output writes: output files are written safely via temp-file replace.

Observability

  • Structured diagnostics callback (diagnostics_callback) can capture stage timings and pipeline stats.
  • CLI prints timing/stats summary by default (--emit-stats / --no-emit-stats).

Edge Case Handling

  • Invalid page selections are validated against document page count.
  • Scanned pages can trigger OCR automatically when no extractable text is found.
  • Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
  • Broken text encoding spans are rasterized to preserve visual fidelity.

API

import reflow

html = reflow.convert(
		"sample.pdf",
		ocr=True,
		layout_detection=True,
		table_recognition=True,
		max_pages=100,
		max_output_bytes=20 * 1024 * 1024,
)

Testing

pytest -q

With coverage gate (same as CI):

pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80

Golden snapshot tests:

pytest -m golden

Update golden snapshots intentionally:

REFLOW_UPDATE_GOLDENS=1 pytest -m golden

Benchmarking

Single document:

reflow benchmark sample.pdf --repeat 3 --output-json bench.json

Directory benchmark:

reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json

Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.

CI

GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.

Notes

  • Python 3.12+
  • Best results for complex layouts require optional ML dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreflow-0.4.1.tar.gz (170.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreflow-0.4.1-py3-none-any.whl (74.7 kB view details)

Uploaded Python 3

File details

Details for the file pyreflow-0.4.1.tar.gz.

File metadata

  • Download URL: pyreflow-0.4.1.tar.gz
  • Upload date:
  • Size: 170.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pyreflow-0.4.1.tar.gz
Algorithm Hash digest
SHA256 6258036cec9e7b9e40481b1a2f5eacd12d02f2486d0fce7c3f2ae98921cf70d9
MD5 402f4c13b1af1a658e736f0328603063
BLAKE2b-256 e0df985b26f6f3a41f8b5a939ef0748d9aadd26092f87ae4fa4cb9b9f667af46

See more details on using hashes here.

File details

Details for the file pyreflow-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: pyreflow-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 74.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pyreflow-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c1d3dd4cc9dab4dc4b9327aa9ee5906ce2ae9ab54d1b4838a91099a337f30e19
MD5 0ea7ace1fd117361964235e8ae2417d2
BLAKE2b-256 40524019f0d30a6527d83eb807927461fe5cebbabc03289d763ba76295e313cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page