State-of-the-art document to HTML converter using deep learning
Project description
Reflow
State-of-the-art PDF/DOCX to semantic HTML converter with strong layout preservation.
Install
pip install -e .
Optional ML features (layout detection, OCR, table transformer):
pip install -e '.[ml]'
Quick Start
reflow convert input.pdf -o output.html
With OCR and guardrails:
reflow convert input.pdf \
--ocr \
--max-pages 200 \
--max-output-mb 50 \
--max-raster-mp 40 \
--device cpu \
-o output.html
Reliability Guardrails
Reflow now supports resource limits to make production behavior predictable:
max_pages: fail fast when document scope is too large.max_output_bytes: fail fast when generated HTML exceeds configured size.max_raster_megapixels: caps raster operations to avoid memory spikes.- Atomic output writes: output files are written safely via temp-file replace.
Observability
- Structured diagnostics callback (
diagnostics_callback) can capture stage timings and pipeline stats. - CLI prints timing/stats summary by default (
--emit-stats/--no-emit-stats).
Edge Case Handling
- Invalid page selections are validated against document page count.
- Scanned pages can trigger OCR automatically when no extractable text is found.
- Layout/table/OCR model failures degrade gracefully with warnings instead of hard crash.
- Broken text encoding spans are rasterized to preserve visual fidelity.
API
import reflow
html = reflow.convert(
"sample.pdf",
ocr=True,
layout_detection=True,
table_recognition=True,
max_pages=100,
max_output_bytes=20 * 1024 * 1024,
)
Testing
pytest -q
With coverage gate (same as CI):
pytest --cov=reflow --cov-report=term-missing --cov-fail-under=80
Golden snapshot tests:
pytest -m golden
Update golden snapshots intentionally:
REFLOW_UPDATE_GOLDENS=1 pytest -m golden
Benchmarking
Single document:
reflow benchmark sample.pdf --repeat 3 --output-json bench.json
Directory benchmark:
reflow benchmark ./pdfs --glob "*.pdf" --repeat 2 --output-json bench.json
Benchmark output includes min/median/max latency, output-size stats, and per-run stage diagnostics.
CI
GitHub Actions CI is configured in .github/workflows/ci.yml and runs tests with coverage threshold enforcement.
Notes
- Python 3.12+
- Best results for complex layouts require optional ML dependencies.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyreflow-0.4.1.tar.gz.
File metadata
- Download URL: pyreflow-0.4.1.tar.gz
- Upload date:
- Size: 170.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6258036cec9e7b9e40481b1a2f5eacd12d02f2486d0fce7c3f2ae98921cf70d9
|
|
| MD5 |
402f4c13b1af1a658e736f0328603063
|
|
| BLAKE2b-256 |
e0df985b26f6f3a41f8b5a939ef0748d9aadd26092f87ae4fa4cb9b9f667af46
|
File details
Details for the file pyreflow-0.4.1-py3-none-any.whl.
File metadata
- Download URL: pyreflow-0.4.1-py3-none-any.whl
- Upload date:
- Size: 74.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d3dd4cc9dab4dc4b9327aa9ee5906ce2ae9ab54d1b4838a91099a337f30e19
|
|
| MD5 |
0ea7ace1fd117361964235e8ae2417d2
|
|
| BLAKE2b-256 |
40524019f0d30a6527d83eb807927461fe5cebbabc03289d763ba76295e313cd
|