Skip to main content

Chop large PDFs into page windows, convert each with docling, and reassemble to Markdown

Project description

doctape pdf to markdown

Converts large PDFs to Markdown by chopping them into page windows, running each window through docling, and reassembling the results. Per-window processing keeps memory bounded, shows progress, and makes long jobs resumable after a crash.

Installation

pip install doctape

This installs the fast layout-based pipeline. For OCR on scanned or cover-art pages, install the extra:

pip install "doctape[ocr]"

Usage

Put PDFs in docs/ and name the one to convert:

doctape complan.pdf

The PDF is converted in 20-page windows. Per-window Markdown lands in out/chunks/<name>/, and the reassembled document in out/<name>.md with <!-- pages NNNN-NNNN --> markers between windows.

Options

Argument Default Meaning
pdf required PDF to convert (filename under --docs-dir, or a path)
--docs-dir docs Directory of source PDFs
--out-dir out Output directory
--chunk-size 20 Pages per window
--force off Re-convert chunks that already exist
--ocr off Force EasyOCR (requires the ocr extra, much slower)

Resuming

A chunk with an existing non-empty .md is skipped, so re-running the same command after an interruption picks up where it stopped. Reassembly always reflects whatever chunks are on disk. Use --force to redo chunks.

OCR

The default pipeline uses layout and table-structure detection without OCR, which is fast and accurate on digital-native PDFs. Scanned pages and stylized cover art will need OCR: install doctape[ocr] and pass --ocr. OCR is much slower on CPU, so reserve it for documents that need it and write OCR output to a separate --out-dir when comparing against a non-OCR run.

Python API

from pathlib import Path
from doctape import build_converter, convert_pdf

converter = build_converter(ocr=False)
convert_pdf(Path("docs/complan.pdf"), Path("out"), chunk_size=20,
            force=False, converter=converter)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctape-0.1.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctape-0.1.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file doctape-0.1.0.tar.gz.

File metadata

  • Download URL: doctape-0.1.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for doctape-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2362c92d34d9d7a9a43546e0e21fc02904a19ecae319de0c9f20e334e927f3db
MD5 66a1a1a8266cbd01a3ffe52bc006a5d5
BLAKE2b-256 c3e7597028f96914aef75264b94b097ab22d4d9a0c690fda3984f277258e2186

See more details on using hashes here.

File details

Details for the file doctape-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doctape-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for doctape-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec46f2951375fd9ccbda084fc3c3564145051682f02c666d2641a5dad5e61f92
MD5 6b194f9d548446df92b664422c3d8977
BLAKE2b-256 0b28b8fca3f5b2f23c6b2a832d4589dd01879b4db7abf2d07d8b2fb80a3fbc13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page