Skip to main content

High-performance PDF processing: extract text, tables, images with a Rust + C core.

Project description

botl-pdf

High-performance PDF text extraction library with a custom Rust core and Python bindings. No dependency on poppler, pdfium, or pdfbox — the entire PDF parsing and text extraction pipeline is written from scratch.

Features

  • Fast text extraction with layout analysis
  • Character-level output with bounding boxes, fonts, colors, and styles
  • Layout-preserving text extraction (spatial whitespace)
  • Table of contents (TOC/outline) extraction with page numbers
  • Document metadata extraction (title, author, dates, etc.)
  • Geometric element extraction (lines, rectangles)
  • Configurable layout parameters (word spacing, line grouping, reading order)
  • Run-aware de-interleaving for correct reading order on complex PDFs
  • Pythonic API with type hints throughout
  • CLI for common operations
  • Zero external PDF library dependencies

Install

pip install botl-pdf

Build from source (requires Rust toolchain):

pip install maturin
git clone https://github.com/botl-pdf/botl-pdf.git
cd botl-pdf
maturin develop --release

Quick Start

import botl_pdf

doc = botl_pdf.open("report.pdf")
text = doc.pages[0].extract_text()
print(text)

Opening Documents

From a file path

import botl_pdf

doc = botl_pdf.open("report.pdf")
print(f"Pages: {doc.num_pages}")
print(f"Encrypted: {doc.is_encrypted}")

From bytes

with open("report.pdf", "rb") as f:
    data = f.read()

doc = botl_pdf.open(data)
print(f"Pages: {doc.num_pages}")

As a context manager

with botl_pdf.open("report.pdf") as doc:
    text = doc.pages[0].extract_text()

Text Extraction

Plain text (default)

Returns clean, readable text. Blocks are separated by double newlines, lines by single newlines, words by spaces.

doc = botl_pdf.open("report.pdf")

# Single page
text = doc.pages[0].extract_text()
print(text)

# All pages
for page in doc.pages:
    print(page.extract_text())

# Subscript access (0-based, supports negative)
text_last = doc.pages[-1].extract_text()

Layout-preserving text

Maintains spatial positioning using proportional spaces between words. Useful when you need to preserve visual alignment of columns, tables, or indented text.

doc = botl_pdf.open("financial_report.pdf")
page = doc.pages[0]

# Layout mode preserves spatial whitespace
layout_text = page.extract_text(layout=True)
print(layout_text)

Tuning extraction parameters

import botl_pdf

doc = botl_pdf.open("two_column.pdf")

# Tighter word grouping (merge chars closer together)
params = botl_pdf.LayoutParams(
    word_margin=1.5,   # max horizontal gap in same word (× font_size), default 2.0
    line_margin=0.5,   # max vertical gap in same block (× line height), default 0.5
    boxes_flow=0.5,    # reading order: 0.0=horizontal, 1.0=vertical, default 0.5
)

text = doc.pages[0].extract_text(layout=True, layout_params=params)

Exporting entire documents

from botl_pdf.export import to_text, to_markdown

# Plain text for all pages
full_text = to_text("report.pdf")

# Layout-preserved text
full_text_layout = to_text("report.pdf", layout=True)

# Markdown (pages separated by horizontal rules)
markdown = to_markdown("report.pdf")

# Specific page range only
markdown_subset = to_markdown("report.pdf", pages=range(0, 5))

Character-Level Access

Each page exposes individual characters with full style information: bounding box, font name, font size, bold/italic flags, fill and stroke colors, rotation, and run ID.

Inspecting individual characters

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

for char in page.chars[:5]:
    print(f"  char={char.text!r}  "
          f"pos=({char.bbox.x0:.1f}, {char.bbox.y0:.1f})  "
          f"size={char.font_size:.1f}  "
          f"font={char.font_name}")

Output:

  char='H'  pos=(100.0, 700.0)  size=12.0  font=F1
  char='e'  pos=(108.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(115.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(120.0, 700.0)  size=12.0  font=F1
  char='o'  pos=(125.0, 700.0)  size=12.0  font=F1

Finding text by style

# Find all bold characters on page 0
bold_chars = [c for c in doc.pages[0].chars if c.bold]
bold_text = "".join(c.text for c in bold_chars)

# Find characters in a specific color (e.g., red links)
red_chars = [
    c for c in doc.pages[0].chars
    if c.color and c.color[0] > 0.8 and c.color[1] < 0.2 and c.color[2] < 0.2
]

# Find large decorative initials (font size > 30)
initials = [c for c in doc.pages[0].chars if c.font_size > 30]
for c in initials:
    print(f"Decorative initial: {c.text!r} at size {c.font_size:.0f}")

Extracting text from a region

# Get all text in a specific rectangular area
x0, y0, x1, y1 = 100.0, 600.0, 400.0, 700.0

region_chars = [
    c for c in doc.pages[0].chars
    if c.bbox.x0 >= x0 and c.bbox.x1 <= x1
    and c.bbox.y0 >= y0 and c.bbox.y1 <= y1
]
region_text = "".join(c.text for c in region_chars)
print(region_text)

Run ID tracking

Characters from the same text-showing operation (Tj/TJ) share a run_id. This lets you group characters by their PDF text operation — useful for debugging extraction issues or understanding the PDF's internal structure.

from collections import defaultdict

# Group characters by their source text operation
runs = defaultdict(str)
for c in doc.pages[0].chars:
    runs[c.run_id] += c.text

for run_id, text in sorted(runs.items()):
    print(f"  Run {run_id}: {text[:60]!r}")

Document Metadata

doc = botl_pdf.open("report.pdf")

meta = doc.metadata
print(f"Title:    {meta.get('title')}")
print(f"Author:   {meta.get('author')}")
print(f"Subject:  {meta.get('subject')}")
print(f"Creator:  {meta.get('creator')}")
print(f"Producer: {meta.get('producer')}")
print(f"Created:  {meta.get('creation_date')}")
print(f"Modified: {meta.get('mod_date')}")
print(f"Version:  {meta.get('version')}")

Table of Contents

doc = botl_pdf.open("book.pdf")

toc = doc.toc
for entry in toc:
    indent = "  " * entry.level
    page = entry.page_number
    print(f"{indent}{entry.title}  →  page {page}")

Output:

Preface  →  page 5
  Acknowledgments  →  page 7
Part I. Foundations  →  page 11
  Chapter 1. Introduction  →  page 13
  Chapter 2. Methods  →  page 27
Part II. Results  →  page 45
  Chapter 3. Analysis  →  page 47

Building a page lookup from TOC

# Map page numbers to their chapter titles
chapters = {}
current_chapter = None
for entry in doc.toc:
    if entry.level == 0 and entry.page_number is not None:
        current_chapter = entry.title
    if current_chapter and entry.page_number is not None:
        chapters[entry.page_number] = current_chapter

# Find which chapter a page belongs to
def chapter_for_page(page_idx):
    page_nums = sorted(chapters.keys())
    for i, p in enumerate(page_nums):
        if page_idx < p:
            return chapters[page_nums[max(0, i - 1)]] if i > 0 else None
    return chapters[page_nums[-1]]

print(f"Page 30 is in: {chapter_for_page(30)}")

Geometric Elements

Pages expose geometric lines and rectangles drawn on the PDF canvas — useful for detecting table borders, rules, decorative elements, and form fields.

Lines

page = doc.pages[0]

for line in page.lines:
    print(f"  Line ({line.x0:.1f},{line.y0:.1f}) → ({line.x1:.1f},{line.y1:.1f})  "
          f"width={line.line_width:.1f}")

Rectangles

for rect in page.rects:
    fill = rect.fill_color
    stroke = rect.stroke_color
    print(f"  Rect ({rect.bbox.x0:.1f},{rect.bbox.y0:.1f})-"
          f"({rect.bbox.x1:.1f},{rect.bbox.y1:.1f})  "
          f"stroke={stroke}  fill={fill}")

Detecting horizontal rules

# Find horizontal lines (useful for detecting separators/tables)
h_rules = [
    line for line in page.lines
    if abs(line.y1 - line.y0) < 1.0 and (line.x1 - line.x0) > 50.0
]

for rule in h_rules:
    print(f"Horizontal rule at y={rule.y0:.1f} from x={rule.x0:.1f} to x={rule.x1:.1f}")

Page Properties

doc = botl_pdf.open("report.pdf")

for i, page in enumerate(doc.pages):
    print(f"Page {i}: {page.width:.0f}×{page.height:.0f}pt  "
          f"rotation={page.rotation}°  "
          f"label={page.label!r}")

Output:

Page 0: 612×792pt  rotation=0°  label='1'
Page 1: 612×792pt  rotation=0°  label='2'

Common page sizes:

  • Letter: 612 × 792 pt (8.5" × 11")
  • A4: 595 × 842 pt (210mm × 297mm)

Visual Debugging

Requires Pillow. Draws bounding boxes and geometric elements on a rendered page image — useful for debugging extraction issues or understanding PDF layout.

pip install botl-pdf[debug]
from botl_pdf.debug import VisualDebugger
import botl_pdf

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

debugger = VisualDebugger(page)

# Draw character bounding boxes (red)
img = debugger.draw_chars(resolution=150)
img.save("debug_chars.png")

# Draw geometric lines (blue)
img = debugger.draw_lines(resolution=150)
img.save("debug_lines.png")

# Draw geometric rectangles (green)
img = debugger.draw_rects(resolution=150)
img.save("debug_rects.png")

# All elements layered together
img = debugger.draw_all(resolution=150)
img.save("debug_all.png")

CLI

pip install botl-pdf[cli]

Extract text

# To stdout
botl-pdf text report.pdf

# To file
botl-pdf text report.pdf --output text.txt

# Specific pages
botl-pdf text report.pdf --pages 1-5

# Layout-preserved
botl-pdf text report.pdf --layout

Show metadata

botl-pdf info report.pdf

Output:

{
  "version": "1.4",
  "page_count": 42,
  "encrypted": false,
  "title": "Annual Report 2024",
  "author": "Acme Corp",
  "creator": "LaTeX",
  "producer": "pdfTeX-1.40"
}

Export

# Markdown
botl-pdf export report.pdf --format markdown --output report.md

# Plain text
botl-pdf export report.pdf --format text --output report.txt

API Reference

botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document

Open a PDF from a file path (str) or raw bytes.

Document

Property / Method Type Description
.metadata dict Metadata fields: title, author, subject, keywords, creator, producer, creation_date, mod_date, version, page_count
.num_pages int Number of pages
.is_encrypted bool Whether the document is encrypted
.toc list[TOCEntry] Table of contents / outline bookmarks
.pages PageCollection Iterable, subscriptable page access
doc[i] PyPage Shortcut for doc.pages[i] (supports negative indices)
len(doc) int Same as .num_pages

Page (via doc.pages[i])

Property / Method Type Description
.extract_text(layout=False, layout_params=None) str Extract text (plain or layout-preserved)
.chars list[Char] All characters with full style info
.lines list[GeomLine] Geometric lines on the page
.rects list[GeomRect] Geometric rectangles on the page
.width float Page width in points
.height float Page height in points
.rotation int Rotation in degrees (0, 90, 180, 270)
.page_number int Zero-based page index
.label str Page label string (e.g. "iii", "A-1")

Char

Property Type Description
.text str Unicode character
.bbox BBox Bounding box
.font_name str Font resource name (e.g. "F1")
.font_size float Font size in points
.bold bool Bold flag
.italic bool Italic flag
.color tuple[float, float, float] or None Fill color (RGB, 0.0-1.0)
.stroking_color tuple[float, float, float] or None Stroke color (RGB, 0.0-1.0)
.rotation float Rotation in degrees
.run_id int Text operation ID (chars from same Tj/TJ share this)

BBox

Property / Method Type Description
.x0, .y0 float Top-left corner
.x1, .y1 float Bottom-right corner
.width float Width (x1 - x0)
.height float Height (y1 - y0)
.center() (float, float) Center point
.area() float Area

TOCEntry

Property Type Description
.title str Outline entry title
.level int Nesting depth (0 = top-level)
.page_number int or None 0-indexed destination page (None if unresolvable)
.dest str or None Raw destination string

GeomLine

Property Type Description
.x0, .y0 float Start point
.x1, .y1 float End point
.line_width float Stroke width
.color tuple or None RGB color (0.0-1.0)

GeomRect

Property Type Description
.bbox BBox Bounding box
.line_width float Stroke width
.stroke_color tuple or None Stroke RGB color
.fill_color tuple or None Fill RGB color

LayoutParams

Parameter Type Default Description
word_margin float 2.0 Max horizontal gap between chars in same word, as a multiple of font size
line_margin float 0.5 Max vertical gap between lines in same block, as a multiple of line height
boxes_flow float 0.5 Reading-order direction (0.0 = strict horizontal, 1.0 = strict vertical)
params = botl_pdf.LayoutParams(word_margin=1.5, line_margin=0.3, boxes_flow=0.0)
text = page.extract_text(layout=True, layout_params=params)

Architecture

PDF bytes
  → Parser (nom tokenizer + recursive-descent objects)
    → Content stream interpreter (Tj/TJ/q/Q/cm operators)
      → Character extraction (CMap, fonts, glyph widths)
        → Layout analysis (chars → words → lines → blocks)
          → Reading order (column detection, run de-interleaving)
            → Text output (plain or layout-preserved)

The pipeline is entirely custom Rust — no dependency on poppler, pdfium, pdfbox, or any other PDF library.

Key design decisions:

  • Run-aware de-interleaving — Each Tj/TJ text operation tags characters with a run_id. When PDF producers interleave characters from different operations at alternating x-positions, the layout engine detects this and groups by run, preserving correct reading order.
  • Font-band separation — Within a line, characters are grouped by font size to handle decorative initials and mixed-size text on the same visual line.
  • Lazy extraction — Page content is decoded on first access and cached. The parsed Document is shared across pages via Arc<Mutex>, so there's no per-page re-parsing.

Benchmarks

Tested against PyMuPDF on real-world PDFs (textbooks, novels, academic papers):

PDF Pages botl-pdf words PyMuPDF words botl-pdf time PyMuPDF time
Electrical engineering textbook 100 35,435 34,708 238ms 174ms
Discrete math textbook 200 89,291 89,968 526ms 426ms
French novel 130 45,355 45,337 293ms 214ms
American Revolution history 293 100,954 99,897 591ms 377ms
Rust Programming Language 3E 560 200,177 196,748 1262ms 873ms
Mystery novel 300 89,610 88,604 568ms 445ms
Total 1583 660,822 655,262 3478ms 2519ms

Word counts match within ~1% of PyMuPDF. Performance is ~1.4x slower.


Development

# Set up environment
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest

# Build Rust extension in release mode
maturin develop --release

# Run Rust tests (198 tests)
cd rust && cargo test

# Run Python tests
pytest tests/python/

# Run benchmarks
pytest tests/python/benchmarks/ --benchmark-only

Project structure

botl-pdf/
├── rust/
│   ├── botl-pdf-core/        # Core engine (parser, text, layout, codecs)
│   ├── botl-pdf-python/      # PyO3 bindings → _core native module
│   └── botl-pdf-csys/        # Image codec FFI (JPEG, JPEG2000)
├── python/botl_pdf/          # High-level Python API
│   ├── document.py           # Document, PageCollection
│   ├── page.py               # Page wrapper
│   ├── export.py             # to_text(), to_markdown()
│   ├── debug.py              # VisualDebugger (Pillow overlays)
│   ├── tables.py             # Table/TableCell dataclasses
│   └── cli/main.py           # CLI: text, info, export
├── tests/
│   ├── rust/                 # Integration tests (parser, text, layout, geometry)
│   └── python/               # Unit + integration tests
└── docs/                     # Sphinx docs

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botlpdf-0.1.2.tar.gz (80.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

botlpdf-0.1.2-cp38-abi3-manylinux_2_34_x86_64.whl (566.7 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file botlpdf-0.1.2.tar.gz.

File metadata

  • Download URL: botlpdf-0.1.2.tar.gz
  • Upload date:
  • Size: 80.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for botlpdf-0.1.2.tar.gz
Algorithm Hash digest
SHA256 13b393a8afc029c5bfde81c9548c4b5ff011338f8cff40bbdf1b6c59b63f84ae
MD5 95f29c1d3e2940ce90800e5193d6f572
BLAKE2b-256 28fb127d21dccacab39970138cd109489ece724fe76b7a76703b98738af017bf

See more details on using hashes here.

File details

Details for the file botlpdf-0.1.2-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for botlpdf-0.1.2-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 54da95ca2319640d727299b689df0602ad2adab6b1727b4e46c4345838c2b7d3
MD5 40a2447fa95509d8251a67378786fa65
BLAKE2b-256 7cad119dab8c4eeec6b037569262170849d076f45594309e1154a0f2afcad9e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page