Skip to main content

State-of-the-art document to HTML converter using deep learning

Project description

Reflow

State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.

Install

pip install -e .

Optional ML features (layout detection, OCR, table transformer):

pip install -e '.[ml]'

Quick Start

reflow convert input.pdf -o output.html

With OCR and guardrails:

reflow convert input.pdf \
	--ocr \
	--max-pages 200 \
	--max-output-mb 50 \
	--max-raster-mp 40 \
	--device cpu \
	-o output.html

Reliability Guardrails

Reflow now supports resource limits to make production behavior predictable:

  • max_pages: fail fast when document scope is too large.
  • max_output_bytes: fail fast when generated HTML exceeds configured size.
  • max_raster_megapixels: caps raster operations to avoid memory spikes.
  • Atomic output writes: output files are written safely via temp-file replace.

Observability

  • Structured diagnostics callback (diagnostics_callback) can capture stage timings and pipeline stats.
  • CLI prints timing/stats summary by default (--emit-stats / --no-emit-stats).

Edge Case Handling

  • Invalid page selections are validated against document page count.
  • Scanned pages can trigger OCR automatically when no extractable text is found.
  • Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
  • Broken text encoding spans are rasterized to preserve visual fidelity.

API

import reflow

html = reflow.convert(
		"sample.pdf",
		ocr=True,
		layout_detection=True,
		table_recognition=True,
		max_pages=100,
		max_output_bytes=20 * 1024 * 1024,
)

Testing

pytest -q

With coverage gate (same as CI):

pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80

Golden snapshot tests:

pytest -m golden

Update golden snapshots intentionally:

REFLOW_UPDATE_GOLDENS=1 pytest -m golden

Benchmarking

Single document:

reflow benchmark sample.pdf --repeat 3 --output-json bench.json

Directory benchmark:

reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json

Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.

CI

GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.

Notes

  • Python 3.12+
  • Best results for complex layouts require optional ML dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyreflow-0.4.0.tar.gz (169.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyreflow-0.4.0-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file pyreflow-0.4.0.tar.gz.

File metadata

  • Download URL: pyreflow-0.4.0.tar.gz
  • Upload date:
  • Size: 169.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pyreflow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 18c24ab97ac4692ccdf77c96d514668cd01fe3389cb8d59bcec8e9f318029fb5
MD5 97f152c65c3bd43c601521cb7f8b6ba3
BLAKE2b-256 3ba1b38a37a2d04cbae3dfa118283f100e6fe19895c723fa3b1e23cdb43a297d

See more details on using hashes here.

File details

Details for the file pyreflow-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pyreflow-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pyreflow-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f55c1373a5141004ee94ae2dcc8a2b564aba2e19b9fb3a36f14a5b93e2a6954b
MD5 4e6a691f62d039e8294e3079cb0b44fd
BLAKE2b-256 26cc58ced818537de7e1572d9b5a54eb2eba49f686ff5c281825608410384599

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page