Chop large PDFs into page windows, convert each with docling, and reassemble to Markdown
Project description
doctape pdf to markdown
Converts large PDFs to Markdown by chopping them into page windows, running each window through docling, and reassembling the results. Per-window processing keeps memory bounded, shows progress, and makes long jobs resumable after a crash.
Installation
pip install doctape
This installs the fast layout-based pipeline. For OCR on scanned or cover-art pages, install the extra:
pip install "doctape[ocr]"
Usage
Put PDFs in docs/ and name the one to convert:
doctape complan.pdf
The PDF is converted in 20-page windows. Per-window Markdown lands in out/chunks/<name>/, and the reassembled document in out/<name>.md with <!-- pages NNNN-NNNN --> markers between windows.
Options
| Argument | Default | Meaning |
|---|---|---|
pdf |
required | PDF to convert (filename under --docs-dir, or a path) |
--docs-dir |
docs |
Directory of source PDFs |
--out-dir |
out |
Output directory |
--chunk-size |
20 |
Pages per window |
--force |
off | Re-convert chunks that already exist |
--ocr |
off | Force EasyOCR (requires the ocr extra, much slower) |
Resuming
A chunk with an existing non-empty .md is skipped, so re-running the same command after an interruption picks up where it stopped. Reassembly always reflects whatever chunks are on disk. Use --force to redo chunks.
OCR
The default pipeline uses layout and table-structure detection without OCR, which is fast and accurate on digital-native PDFs. Scanned pages and stylized cover art will need OCR: install doctape[ocr] and pass --ocr. OCR is much slower on CPU, so reserve it for documents that need it and write OCR output to a separate --out-dir when comparing against a non-OCR run.
Python API
from pathlib import Path
from doctape import build_converter, convert_pdf
converter = build_converter(ocr=False)
convert_pdf(Path("docs/complan.pdf"), Path("out"), chunk_size=20,
force=False, converter=converter)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctape-0.1.0.tar.gz.
File metadata
- Download URL: doctape-0.1.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2362c92d34d9d7a9a43546e0e21fc02904a19ecae319de0c9f20e334e927f3db
|
|
| MD5 |
66a1a1a8266cbd01a3ffe52bc006a5d5
|
|
| BLAKE2b-256 |
c3e7597028f96914aef75264b94b097ab22d4d9a0c690fda3984f277258e2186
|
File details
Details for the file doctape-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doctape-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec46f2951375fd9ccbda084fc3c3564145051682f02c666d2641a5dad5e61f92
|
|
| MD5 |
6b194f9d548446df92b664422c3d8977
|
|
| BLAKE2b-256 |
0b28b8fca3f5b2f23c6b2a832d4589dd01879b4db7abf2d07d8b2fb80a3fbc13
|