Skip to main content

High-performance PDF processing: extract text, tables, images with a Rust + C core.

Project description

botl-pdf

PyPI version Python versions License

High-performance PDF text extraction library with a custom Rust core and Python bindings. No dependency on poppler, pdfium, or pdfbox — the entire PDF parsing and text extraction pipeline is written from scratch.

Features

  • Fast text extraction with layout analysis
  • Character-level output with bounding boxes, fonts, colors, and styles
  • Layout-preserving text extraction (spatial whitespace)
  • Table of contents (TOC/outline) extraction with page numbers
  • Document metadata extraction (title, author, dates, etc.)
  • Geometric element extraction (lines, rectangles)
  • Configurable layout parameters (word spacing, line grouping, reading order)
  • Run-aware de-interleaving for correct reading order on complex PDFs
  • Pythonic API with type hints throughout
  • CLI for common operations
  • Zero external PDF library dependencies

Install

pip install botlpdf

Build from source (requires Rust toolchain):

pip install maturin
git clone https://github.com/Shivamjohri247/botl-pdf.git
cd botl-pdf
maturin develop --release

Quick Start

import botl_pdf

doc = botl_pdf.open("report.pdf")
text = doc.pages[0].extract_text()
print(text)

Opening Documents

From a file path

import botl_pdf

doc = botl_pdf.open("report.pdf")
print(f"Pages: {doc.num_pages}")
print(f"Encrypted: {doc.is_encrypted}")

From bytes

with open("report.pdf", "rb") as f:
    data = f.read()

doc = botl_pdf.open(data)
print(f"Pages: {doc.num_pages}")

As a context manager

with botl_pdf.open("report.pdf") as doc:
    text = doc.pages[0].extract_text()

Text Extraction

Plain text (default)

Returns clean, readable text. Blocks are separated by double newlines, lines by single newlines, words by spaces.

doc = botl_pdf.open("report.pdf")

# Single page
text = doc.pages[0].extract_text()
print(text)

# All pages
for page in doc.pages:
    print(page.extract_text())

# Subscript access (0-based, supports negative)
text_last = doc.pages[-1].extract_text()

Layout-preserving text

Maintains spatial positioning using proportional spaces between words. Useful when you need to preserve visual alignment of columns, tables, or indented text.

doc = botl_pdf.open("financial_report.pdf")
page = doc.pages[0]

# Layout mode preserves spatial whitespace
layout_text = page.extract_text(layout=True)
print(layout_text)

Tuning extraction parameters

import botl_pdf

doc = botl_pdf.open("two_column.pdf")

# Tighter word grouping (merge chars closer together)
params = botl_pdf.LayoutParams(
    word_margin=1.5,   # max horizontal gap in same word (× font_size), default 2.0
    line_margin=0.5,   # max vertical gap in same block (× line height), default 0.5
    boxes_flow=0.5,    # reading order: 0.0=horizontal, 1.0=vertical, default 0.5
)

text = doc.pages[0].extract_text(layout=True, layout_params=params)

Exporting entire documents

from botl_pdf.export import to_text, to_markdown

# Plain text for all pages
full_text = to_text("report.pdf")

# Layout-preserved text
full_text_layout = to_text("report.pdf", layout=True)

# Markdown (pages separated by horizontal rules)
markdown = to_markdown("report.pdf")

# Specific page range only
markdown_subset = to_markdown("report.pdf", pages=range(0, 5))

Character-Level Access

Each page exposes individual characters with full style information: bounding box, font name, font size, bold/italic flags, fill and stroke colors, rotation, and run ID.

Inspecting individual characters

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

for char in page.chars[:5]:
    print(f"  char={char.text!r}  "
          f"pos=({char.bbox.x0:.1f}, {char.bbox.y0:.1f})  "
          f"size={char.font_size:.1f}  "
          f"font={char.font_name}")

Output:

  char='H'  pos=(100.0, 700.0)  size=12.0  font=F1
  char='e'  pos=(108.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(115.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(120.0, 700.0)  size=12.0  font=F1
  char='o'  pos=(125.0, 700.0)  size=12.0  font=F1

Finding text by style

# Find all bold characters on page 0
bold_chars = [c for c in doc.pages[0].chars if c.bold]
bold_text = "".join(c.text for c in bold_chars)

# Find characters in a specific color (e.g., red links)
red_chars = [
    c for c in doc.pages[0].chars
    if c.color and c.color[0] > 0.8 and c.color[1] < 0.2 and c.color[2] < 0.2
]

# Find large decorative initials (font size > 30)
initials = [c for c in doc.pages[0].chars if c.font_size > 30]
for c in initials:
    print(f"Decorative initial: {c.text!r} at size {c.font_size:.0f}")

Extracting text from a region

# Get all text in a specific rectangular area
x0, y0, x1, y1 = 100.0, 600.0, 400.0, 700.0

region_chars = [
    c for c in doc.pages[0].chars
    if c.bbox.x0 >= x0 and c.bbox.x1 <= x1
    and c.bbox.y0 >= y0 and c.bbox.y1 <= y1
]
region_text = "".join(c.text for c in region_chars)
print(region_text)

Run ID tracking

Characters from the same text-showing operation (Tj/TJ) share a run_id. This lets you group characters by their PDF text operation — useful for debugging extraction issues or understanding the PDF's internal structure.

from collections import defaultdict

# Group characters by their source text operation
runs = defaultdict(str)
for c in doc.pages[0].chars:
    runs[c.run_id] += c.text

for run_id, text in sorted(runs.items()):
    print(f"  Run {run_id}: {text[:60]!r}")

Document Metadata

doc = botl_pdf.open("report.pdf")

meta = doc.metadata
print(f"Title:    {meta.get('title')}")
print(f"Author:   {meta.get('author')}")
print(f"Subject:  {meta.get('subject')}")
print(f"Creator:  {meta.get('creator')}")
print(f"Producer: {meta.get('producer')}")
print(f"Created:  {meta.get('creation_date')}")
print(f"Modified: {meta.get('mod_date')}")
print(f"Version:  {meta.get('version')}")

Table of Contents

doc = botl_pdf.open("book.pdf")

toc = doc.toc
for entry in toc:
    indent = "  " * entry.level
    page = entry.page_number
    print(f"{indent}{entry.title}  →  page {page}")

Output:

Preface  →  page 5
  Acknowledgments  →  page 7
Part I. Foundations  →  page 11
  Chapter 1. Introduction  →  page 13
  Chapter 2. Methods  →  page 27
Part II. Results  →  page 45
  Chapter 3. Analysis  →  page 47

Building a page lookup from TOC

# Map page numbers to their chapter titles
chapters = {}
current_chapter = None
for entry in doc.toc:
    if entry.level == 0 and entry.page_number is not None:
        current_chapter = entry.title
    if current_chapter and entry.page_number is not None:
        chapters[entry.page_number] = current_chapter

# Find which chapter a page belongs to
def chapter_for_page(page_idx):
    page_nums = sorted(chapters.keys())
    for i, p in enumerate(page_nums):
        if page_idx < p:
            return chapters[page_nums[max(0, i - 1)]] if i > 0 else None
    return chapters[page_nums[-1]]

print(f"Page 30 is in: {chapter_for_page(30)}")

Geometric Elements

Pages expose geometric lines and rectangles drawn on the PDF canvas — useful for detecting table borders, rules, decorative elements, and form fields.

Lines

page = doc.pages[0]

for line in page.lines:
    print(f"  Line ({line.x0:.1f},{line.y0:.1f}) → ({line.x1:.1f},{line.y1:.1f})  "
          f"width={line.line_width:.1f}")

Rectangles

for rect in page.rects:
    fill = rect.fill_color
    stroke = rect.stroke_color
    print(f"  Rect ({rect.bbox.x0:.1f},{rect.bbox.y0:.1f})-"
          f"({rect.bbox.x1:.1f},{rect.bbox.y1:.1f})  "
          f"stroke={stroke}  fill={fill}")

Detecting horizontal rules

# Find horizontal lines (useful for detecting separators/tables)
h_rules = [
    line for line in page.lines
    if abs(line.y1 - line.y0) < 1.0 and (line.x1 - line.x0) > 50.0
]

for rule in h_rules:
    print(f"Horizontal rule at y={rule.y0:.1f} from x={rule.x0:.1f} to x={rule.x1:.1f}")

Page Properties

doc = botl_pdf.open("report.pdf")

for i, page in enumerate(doc.pages):
    print(f"Page {i}: {page.width:.0f}×{page.height:.0f}pt  "
          f"rotation={page.rotation}°  "
          f"label={page.label!r}")

Output:

Page 0: 612×792pt  rotation=0°  label='1'
Page 1: 612×792pt  rotation=0°  label='2'

Common page sizes:

  • Letter: 612 × 792 pt (8.5" × 11")
  • A4: 595 × 842 pt (210mm × 297mm)

Visual Debugging

Requires Pillow. Draws bounding boxes and geometric elements on a rendered page image — useful for debugging extraction issues or understanding PDF layout.

pip install botlpdf[debug]
from botl_pdf.debug import VisualDebugger
import botl_pdf

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

debugger = VisualDebugger(page)

# Draw character bounding boxes (red)
img = debugger.draw_chars(resolution=150)
img.save("debug_chars.png")

# Draw geometric lines (blue)
img = debugger.draw_lines(resolution=150)
img.save("debug_lines.png")

# Draw geometric rectangles (green)
img = debugger.draw_rects(resolution=150)
img.save("debug_rects.png")

# All elements layered together
img = debugger.draw_all(resolution=150)
img.save("debug_all.png")

CLI

pip install botlpdf[cli]

Extract text

# To stdout
botl-pdf text report.pdf

# To file
botl-pdf text report.pdf --output text.txt

# Specific pages
botl-pdf text report.pdf --pages 1-5

# Layout-preserved
botl-pdf text report.pdf --layout

Show metadata

botl-pdf info report.pdf

Output:

{
  "version": "1.4",
  "page_count": 42,
  "encrypted": false,
  "title": "Annual Report 2024",
  "author": "Acme Corp",
  "creator": "LaTeX",
  "producer": "pdfTeX-1.40"
}

Export

# Markdown
botl-pdf export report.pdf --format markdown --output report.md

# Plain text
botl-pdf export report.pdf --format text --output report.txt

API Reference

botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document

Open a PDF from a file path (str) or raw bytes.

Document

Property / Method Type Description
.metadata dict Metadata fields: title, author, subject, keywords, creator, producer, creation_date, mod_date, version, page_count
.num_pages int Number of pages
.is_encrypted bool Whether the document is encrypted
.toc list[TOCEntry] Table of contents / outline bookmarks
.pages PageCollection Iterable, subscriptable page access
doc[i] PyPage Shortcut for doc.pages[i] (supports negative indices)
len(doc) int Same as .num_pages

Page (via doc.pages[i])

Property / Method Type Description
.extract_text(layout=False, layout_params=None) str Extract text (plain or layout-preserved)
.chars list[Char] All characters with full style info
.lines list[GeomLine] Geometric lines on the page
.rects list[GeomRect] Geometric rectangles on the page
.width float Page width in points
.height float Page height in points
.rotation int Rotation in degrees (0, 90, 180, 270)
.page_number int Zero-based page index
.label str Page label string (e.g. "iii", "A-1")

Char

Property Type Description
.text str Unicode character
.bbox BBox Bounding box
.font_name str Font resource name (e.g. "F1")
.font_size float Font size in points
.bold bool Bold flag
.italic bool Italic flag
.color tuple[float, float, float] or None Fill color (RGB, 0.0-1.0)
.stroking_color tuple[float, float, float] or None Stroke color (RGB, 0.0-1.0)
.rotation float Rotation in degrees
.run_id int Text operation ID (chars from same Tj/TJ share this)

BBox

Property / Method Type Description
.x0, .y0 float Top-left corner
.x1, .y1 float Bottom-right corner
.width float Width (x1 - x0)
.height float Height (y1 - y0)
.center() (float, float) Center point
.area() float Area

TOCEntry

Property Type Description
.title str Outline entry title
.level int Nesting depth (0 = top-level)
.page_number int or None 0-indexed destination page (None if unresolvable)
.dest str or None Raw destination string

GeomLine

Property Type Description
.x0, .y0 float Start point
.x1, .y1 float End point
.line_width float Stroke width
.color tuple or None RGB color (0.0-1.0)

GeomRect

Property Type Description
.bbox BBox Bounding box
.line_width float Stroke width
.stroke_color tuple or None Stroke RGB color
.fill_color tuple or None Fill RGB color

LayoutParams

Parameter Type Default Description
word_margin float 2.0 Max horizontal gap between chars in same word, as a multiple of font size
line_margin float 0.5 Max vertical gap between lines in same block, as a multiple of line height
boxes_flow float 0.5 Reading-order direction (0.0 = strict horizontal, 1.0 = strict vertical)
params = botl_pdf.LayoutParams(word_margin=1.5, line_margin=0.3, boxes_flow=0.0)
text = page.extract_text(layout=True, layout_params=params)

Architecture

PDF bytes
  → Parser (nom tokenizer + recursive-descent objects)
    → Content stream interpreter (Tj/TJ/q/Q/cm operators)
      → Character extraction (CMap, fonts, glyph widths)
        → Layout analysis (chars → words → lines → blocks)
          → Reading order (column detection, run de-interleaving)
            → Text output (plain or layout-preserved)

The pipeline is entirely custom Rust — no dependency on poppler, pdfium, pdfbox, or any other PDF library.

Key design decisions:

  • Run-aware de-interleaving — Each Tj/TJ text operation tags characters with a run_id. When PDF producers interleave characters from different operations at alternating x-positions, the layout engine detects this and groups by run, preserving correct reading order.
  • Font-band separation — Within a line, characters are grouped by font size to handle decorative initials and mixed-size text on the same visual line.
  • Lazy extraction — Page content is decoded on first access and cached. The parsed Document is shared across pages via Arc<Mutex>, so there's no per-page re-parsing.

Benchmarks

Tested against PyMuPDF on real-world PDFs (textbooks, novels, academic papers). v0.2.0 includes performance optimizations and improved word boundary detection.

Text Extraction Quality

PDF Pages botl-pdf words PyMuPDF words Word coverage
Acrimonious (novel) 408 118,767 110,314 107.7%
Agentic Mesh (tech) 558 136,669 132,386 103.2%
Azure Fundamentals 576 89,490 87,183 102.6%
Data Science (textbook) 438 100,594 93,286 107.8%
Discrete Math (textbook) 565 93,691 89,968 104.1%
Mastering AI System Design 1,038 85,854 82,608 103.9%
System Design Interview 341 47,769 46,523 102.7%
American Revolution 293 107,411 99,897 107.5%
Rust Programming 3E 806 203,941 196,748 103.7%
Total 6,663 1,399,763 1,331,742 105.1%

Character-level coverage: 99.7% of PyMuPDF. botl-pdf extracts 5% more words overall.

Performance

PDF Pages botl-pdf PyMuPDF Ratio
Mastering AI System Design 1,038 0.56s 0.72s 0.78x (faster)
System Design Interview 341 0.21s 0.31s 0.66x (faster)
Discrete Math 565 0.45s 0.45s 1.00x (equal)
Faking Fore-Ever (novel) 196 0.21s 0.21s 0.98x (faster)
American Revolution 293 0.49s 0.39s 1.27x
Rust Programming 3E 806 0.91s 0.73s 1.23x
Overall (17 PDFs) 6,663 6.40s 5.86s 1.09x

Overall ~9% slower than PyMuPDF, faster on 5 of 17 PDFs. Competitive on the rest.

What changed in v0.2.0

  • ~2x faster than v0.1.x through Arc-based caching, cross-page font cache, zlib-ng backend, and reduced cloning
  • Fixed word boundary detection for PDFs that encode spaces as position gaps instead of literal space characters
  • Character coverage improved from partial to 99.7% of PyMuPDF across diverse PDF types

Development

# Set up environment
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest

# Build Rust extension in release mode
maturin develop --release

# Run Rust tests (198 tests)
cd rust && cargo test

# Run Python tests
pytest tests/python/

# Run benchmarks
pytest tests/python/benchmarks/ --benchmark-only

Project structure

botl-pdf/
├── rust/
│   ├── botl-pdf-core/        # Core engine (parser, text, layout, codecs)
│   ├── botl-pdf-python/      # PyO3 bindings → _core native module
│   └── botl-pdf-csys/        # Image codec FFI (JPEG, JPEG2000)
├── python/botl_pdf/          # High-level Python API
│   ├── document.py           # Document, PageCollection
│   ├── page.py               # Page wrapper
│   ├── export.py             # to_text(), to_markdown()
│   ├── debug.py              # VisualDebugger (Pillow overlays)
│   ├── tables.py             # Table/TableCell dataclasses
│   └── cli/main.py           # CLI: text, info, export
├── tests/
│   ├── rust/                 # Integration tests (parser, text, layout, geometry)
│   └── python/               # Unit + integration tests
└── docs/                     # Sphinx docs

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botlpdf-0.2.0.tar.gz (81.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl (620.4 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file botlpdf-0.2.0.tar.gz.

File metadata

  • Download URL: botlpdf-0.2.0.tar.gz
  • Upload date:
  • Size: 81.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for botlpdf-0.2.0.tar.gz
Algorithm Hash digest
SHA256 de6a5d7a1e8a88b4d6f0163efb241a27c4eb27dab304e6711c404fced490b38e
MD5 17b8ac9d8d2f33f3d90aa5876799d538
BLAKE2b-256 cec2864518e0da9efdc788745a542c4a2a7a76cb70b5269aa92d97e40d82b2c7

See more details on using hashes here.

File details

Details for the file botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 80dd8136038300750156eadc08fcbecfccab26d4bd8e98136b4305570dcd27c5
MD5 36ee4c3148fde93abb5839be7a0f3a6c
BLAKE2b-256 8794e55cafe31fb66e2e1c6d128eb827b6abc803dc9de877580ed5547b207165

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page