High-performance PDF processing: extract text, tables, images with a Rust + C core.

These details have not been verified by PyPI

Project links

Project description

botl-pdf

High-performance PDF text extraction library with a custom Rust core and Python bindings. No dependency on poppler, pdfium, or pdfbox — the entire PDF parsing and text extraction pipeline is written from scratch.

Features

Fast text extraction with layout analysis
Character-level output with bounding boxes, fonts, colors, and styles
Layout-preserving text extraction (spatial whitespace)
Table of contents (TOC/outline) extraction with page numbers
Document metadata extraction (title, author, dates, etc.)
Geometric element extraction (lines, rectangles)
Configurable layout parameters (word spacing, line grouping, reading order)
Run-aware de-interleaving for correct reading order on complex PDFs
Pythonic API with type hints throughout
CLI for common operations
Zero external PDF library dependencies

Install

pip install botlpdf

Build from source (requires Rust toolchain):

pip install maturin
git clone https://github.com/Shivamjohri247/botl-pdf.git
cd botl-pdf
maturin develop --release

Quick Start

import botl_pdf

doc = botl_pdf.open("report.pdf")
text = doc.pages[0].extract_text()
print(text)

Opening Documents

From a file path

import botl_pdf

doc = botl_pdf.open("report.pdf")
print(f"Pages: {doc.num_pages}")
print(f"Encrypted: {doc.is_encrypted}")

From bytes

with open("report.pdf", "rb") as f:
    data = f.read()

doc = botl_pdf.open(data)
print(f"Pages: {doc.num_pages}")

As a context manager

with botl_pdf.open("report.pdf") as doc:
    text = doc.pages[0].extract_text()

Text Extraction

Plain text (default)

Returns clean, readable text. Blocks are separated by double newlines, lines by single newlines, words by spaces.

doc = botl_pdf.open("report.pdf")

# Single page
text = doc.pages[0].extract_text()
print(text)

# All pages
for page in doc.pages:
    print(page.extract_text())

# Subscript access (0-based, supports negative)
text_last = doc.pages[-1].extract_text()

Layout-preserving text

Maintains spatial positioning using proportional spaces between words. Useful when you need to preserve visual alignment of columns, tables, or indented text.

doc = botl_pdf.open("financial_report.pdf")
page = doc.pages[0]

# Layout mode preserves spatial whitespace
layout_text = page.extract_text(layout=True)
print(layout_text)

Tuning extraction parameters

import botl_pdf

doc = botl_pdf.open("two_column.pdf")

# Tighter word grouping (merge chars closer together)
params = botl_pdf.LayoutParams(
    word_margin=1.5,   # max horizontal gap in same word (× font_size), default 2.0
    line_margin=0.5,   # max vertical gap in same block (× line height), default 0.5
    boxes_flow=0.5,    # reading order: 0.0=horizontal, 1.0=vertical, default 0.5
)

text = doc.pages[0].extract_text(layout=True, layout_params=params)

Exporting entire documents

from botl_pdf.export import to_text, to_markdown

# Plain text for all pages
full_text = to_text("report.pdf")

# Layout-preserved text
full_text_layout = to_text("report.pdf", layout=True)

# Markdown (pages separated by horizontal rules)
markdown = to_markdown("report.pdf")

# Specific page range only
markdown_subset = to_markdown("report.pdf", pages=range(0, 5))

Character-Level Access

Each page exposes individual characters with full style information: bounding box, font name, font size, bold/italic flags, fill and stroke colors, rotation, and run ID.

Inspecting individual characters

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

for char in page.chars[:5]:
    print(f"  char={char.text!r}  "
          f"pos=({char.bbox.x0:.1f}, {char.bbox.y0:.1f})  "
          f"size={char.font_size:.1f}  "
          f"font={char.font_name}")

Output:

  char='H'  pos=(100.0, 700.0)  size=12.0  font=F1
  char='e'  pos=(108.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(115.0, 700.0)  size=12.0  font=F1
  char='l'  pos=(120.0, 700.0)  size=12.0  font=F1
  char='o'  pos=(125.0, 700.0)  size=12.0  font=F1

Finding text by style

# Find all bold characters on page 0
bold_chars = [c for c in doc.pages[0].chars if c.bold]
bold_text = "".join(c.text for c in bold_chars)

# Find characters in a specific color (e.g., red links)
red_chars = [
    c for c in doc.pages[0].chars
    if c.color and c.color[0] > 0.8 and c.color[1] < 0.2 and c.color[2] < 0.2
]

# Find large decorative initials (font size > 30)
initials = [c for c in doc.pages[0].chars if c.font_size > 30]
for c in initials:
    print(f"Decorative initial: {c.text!r} at size {c.font_size:.0f}")

Extracting text from a region

# Get all text in a specific rectangular area
x0, y0, x1, y1 = 100.0, 600.0, 400.0, 700.0

region_chars = [
    c for c in doc.pages[0].chars
    if c.bbox.x0 >= x0 and c.bbox.x1 <= x1
    and c.bbox.y0 >= y0 and c.bbox.y1 <= y1
]
region_text = "".join(c.text for c in region_chars)
print(region_text)

Run ID tracking

Characters from the same text-showing operation (Tj/TJ) share a run_id. This lets you group characters by their PDF text operation — useful for debugging extraction issues or understanding the PDF's internal structure.

from collections import defaultdict

# Group characters by their source text operation
runs = defaultdict(str)
for c in doc.pages[0].chars:
    runs[c.run_id] += c.text

for run_id, text in sorted(runs.items()):
    print(f"  Run {run_id}: {text[:60]!r}")

Document Metadata

doc = botl_pdf.open("report.pdf")

meta = doc.metadata
print(f"Title:    {meta.get('title')}")
print(f"Author:   {meta.get('author')}")
print(f"Subject:  {meta.get('subject')}")
print(f"Creator:  {meta.get('creator')}")
print(f"Producer: {meta.get('producer')}")
print(f"Created:  {meta.get('creation_date')}")
print(f"Modified: {meta.get('mod_date')}")
print(f"Version:  {meta.get('version')}")

doc = botl_pdf.open("book.pdf")

toc = doc.toc
for entry in toc:
    indent = "  " * entry.level
    page = entry.page_number
    print(f"{indent}{entry.title}  →  page {page}")

Output:

Preface  →  page 5
  Acknowledgments  →  page 7
Part I. Foundations  →  page 11
  Chapter 1. Introduction  →  page 13
  Chapter 2. Methods  →  page 27
Part II. Results  →  page 45
  Chapter 3. Analysis  →  page 47

Building a page lookup from TOC

# Map page numbers to their chapter titles
chapters = {}
current_chapter = None
for entry in doc.toc:
    if entry.level == 0 and entry.page_number is not None:
        current_chapter = entry.title
    if current_chapter and entry.page_number is not None:
        chapters[entry.page_number] = current_chapter

# Find which chapter a page belongs to
def chapter_for_page(page_idx):
    page_nums = sorted(chapters.keys())
    for i, p in enumerate(page_nums):
        if page_idx < p:
            return chapters[page_nums[max(0, i - 1)]] if i > 0 else None
    return chapters[page_nums[-1]]

print(f"Page 30 is in: {chapter_for_page(30)}")

Geometric Elements

Pages expose geometric lines and rectangles drawn on the PDF canvas — useful for detecting table borders, rules, decorative elements, and form fields.

Lines

page = doc.pages[0]

for line in page.lines:
    print(f"  Line ({line.x0:.1f},{line.y0:.1f}) → ({line.x1:.1f},{line.y1:.1f})  "
          f"width={line.line_width:.1f}")

Rectangles

for rect in page.rects:
    fill = rect.fill_color
    stroke = rect.stroke_color
    print(f"  Rect ({rect.bbox.x0:.1f},{rect.bbox.y0:.1f})-"
          f"({rect.bbox.x1:.1f},{rect.bbox.y1:.1f})  "
          f"stroke={stroke}  fill={fill}")

Detecting horizontal rules

# Find horizontal lines (useful for detecting separators/tables)
h_rules = [
    line for line in page.lines
    if abs(line.y1 - line.y0) < 1.0 and (line.x1 - line.x0) > 50.0
]

for rule in h_rules:
    print(f"Horizontal rule at y={rule.y0:.1f} from x={rule.x0:.1f} to x={rule.x1:.1f}")

Page Properties

doc = botl_pdf.open("report.pdf")

for i, page in enumerate(doc.pages):
    print(f"Page {i}: {page.width:.0f}×{page.height:.0f}pt  "
          f"rotation={page.rotation}°  "
          f"label={page.label!r}")

Output:

Page 0: 612×792pt  rotation=0°  label='1'
Page 1: 612×792pt  rotation=0°  label='2'

Common page sizes:

Letter: 612 × 792 pt (8.5" × 11")
A4: 595 × 842 pt (210mm × 297mm)

Visual Debugging

Requires Pillow. Draws bounding boxes and geometric elements on a rendered page image — useful for debugging extraction issues or understanding PDF layout.

pip install botlpdf[debug]

from botl_pdf.debug import VisualDebugger
import botl_pdf

doc = botl_pdf.open("report.pdf")
page = doc.pages[0]

debugger = VisualDebugger(page)

# Draw character bounding boxes (red)
img = debugger.draw_chars(resolution=150)
img.save("debug_chars.png")

# Draw geometric lines (blue)
img = debugger.draw_lines(resolution=150)
img.save("debug_lines.png")

# Draw geometric rectangles (green)
img = debugger.draw_rects(resolution=150)
img.save("debug_rects.png")

# All elements layered together
img = debugger.draw_all(resolution=150)
img.save("debug_all.png")

CLI

pip install botlpdf[cli]

Extract text

# To stdout
botl-pdf text report.pdf

# To file
botl-pdf text report.pdf --output text.txt

# Specific pages
botl-pdf text report.pdf --pages 1-5

# Layout-preserved
botl-pdf text report.pdf --layout

Show metadata

botl-pdf info report.pdf

Output:

{
  "version": "1.4",
  "page_count": 42,
  "encrypted": false,
  "title": "Annual Report 2024",
  "author": "Acme Corp",
  "creator": "LaTeX",
  "producer": "pdfTeX-1.40"
}

Export

# Markdown
botl-pdf export report.pdf --format markdown --output report.md

# Plain text
botl-pdf export report.pdf --format text --output report.txt

API Reference

`botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document`

Open a PDF from a file path (str) or raw bytes.

`Document`

Property / Method	Type	Description
`.metadata`	`dict`	Metadata fields: title, author, subject, keywords, creator, producer, creation_date, mod_date, version, page_count
`.num_pages`	`int`	Number of pages
`.is_encrypted`	`bool`	Whether the document is encrypted
`.toc`	`list[TOCEntry]`	Table of contents / outline bookmarks
`.pages`	`PageCollection`	Iterable, subscriptable page access
`doc[i]`	`PyPage`	Shortcut for `doc.pages[i]` (supports negative indices)
`len(doc)`	`int`	Same as `.num_pages`

`Page` (via `doc.pages[i]`)

Property / Method	Type	Description
`.extract_text(layout=False, layout_params=None)`	`str`	Extract text (plain or layout-preserved)
`.chars`	`list[Char]`	All characters with full style info
`.lines`	`list[GeomLine]`	Geometric lines on the page
`.rects`	`list[GeomRect]`	Geometric rectangles on the page
`.width`	`float`	Page width in points
`.height`	`float`	Page height in points
`.rotation`	`int`	Rotation in degrees (0, 90, 180, 270)
`.page_number`	`int`	Zero-based page index
`.label`	`str`	Page label string (e.g. "iii", "A-1")

`Char`

Property	Type	Description
`.text`	`str`	Unicode character
`.bbox`	`BBox`	Bounding box
`.font_name`	`str`	Font resource name (e.g. "F1")
`.font_size`	`float`	Font size in points
`.bold`	`bool`	Bold flag
`.italic`	`bool`	Italic flag
`.color`	`tuple[float, float, float] or None`	Fill color (RGB, 0.0-1.0)
`.stroking_color`	`tuple[float, float, float] or None`	Stroke color (RGB, 0.0-1.0)
`.rotation`	`float`	Rotation in degrees
`.run_id`	`int`	Text operation ID (chars from same Tj/TJ share this)

`BBox`

Property / Method	Type	Description
`.x0`, `.y0`	`float`	Top-left corner
`.x1`, `.y1`	`float`	Bottom-right corner
`.width`	`float`	Width (x1 - x0)
`.height`	`float`	Height (y1 - y0)
`.center()`	`(float, float)`	Center point
`.area()`	`float`	Area

`TOCEntry`

Property	Type	Description
`.title`	`str`	Outline entry title
`.level`	`int`	Nesting depth (0 = top-level)
`.page_number`	`int or None`	0-indexed destination page (None if unresolvable)
`.dest`	`str or None`	Raw destination string

`GeomLine`

Property	Type	Description
`.x0`, `.y0`	`float`	Start point
`.x1`, `.y1`	`float`	End point
`.line_width`	`float`	Stroke width
`.color`	`tuple or None`	RGB color (0.0-1.0)

`GeomRect`

Property	Type	Description
`.bbox`	`BBox`	Bounding box
`.line_width`	`float`	Stroke width
`.stroke_color`	`tuple or None`	Stroke RGB color
`.fill_color`	`tuple or None`	Fill RGB color

`LayoutParams`

Parameter	Type	Default	Description
`word_margin`	`float`	`2.0`	Max horizontal gap between chars in same word, as a multiple of font size
`line_margin`	`float`	`0.5`	Max vertical gap between lines in same block, as a multiple of line height
`boxes_flow`	`float`	`0.5`	Reading-order direction (0.0 = strict horizontal, 1.0 = strict vertical)

params = botl_pdf.LayoutParams(word_margin=1.5, line_margin=0.3, boxes_flow=0.0)
text = page.extract_text(layout=True, layout_params=params)

Architecture

PDF bytes
  → Parser (nom tokenizer + recursive-descent objects)
    → Content stream interpreter (Tj/TJ/q/Q/cm operators)
      → Character extraction (CMap, fonts, glyph widths)
        → Layout analysis (chars → words → lines → blocks)
          → Reading order (column detection, run de-interleaving)
            → Text output (plain or layout-preserved)

The pipeline is entirely custom Rust — no dependency on poppler, pdfium, pdfbox, or any other PDF library.

Key design decisions:

Run-aware de-interleaving — Each Tj/TJ text operation tags characters with a run_id. When PDF producers interleave characters from different operations at alternating x-positions, the layout engine detects this and groups by run, preserving correct reading order.
Font-band separation — Within a line, characters are grouped by font size to handle decorative initials and mixed-size text on the same visual line.
Lazy extraction — Page content is decoded on first access and cached. The parsed Document is shared across pages via Arc<Mutex>, so there's no per-page re-parsing.

Benchmarks

Tested against PyMuPDF on real-world PDFs (textbooks, novels, academic papers). v0.2.0 includes performance optimizations and improved word boundary detection.

Text Extraction Quality

PDF	Pages	botl-pdf words	PyMuPDF words	Word coverage
Acrimonious (novel)	408	118,767	110,314	107.7%
Agentic Mesh (tech)	558	136,669	132,386	103.2%
Azure Fundamentals	576	89,490	87,183	102.6%
Data Science (textbook)	438	100,594	93,286	107.8%
Discrete Math (textbook)	565	93,691	89,968	104.1%
Mastering AI System Design	1,038	85,854	82,608	103.9%
System Design Interview	341	47,769	46,523	102.7%
American Revolution	293	107,411	99,897	107.5%
Rust Programming 3E	806	203,941	196,748	103.7%
Total	6,663	1,399,763	1,331,742	105.1%

Character-level coverage: 99.7% of PyMuPDF. botl-pdf extracts 5% more words overall.

Performance

PDF	Pages	botl-pdf	PyMuPDF	Ratio
Mastering AI System Design	1,038	0.56s	0.72s	0.78x (faster)
System Design Interview	341	0.21s	0.31s	0.66x (faster)
Discrete Math	565	0.45s	0.45s	1.00x (equal)
Faking Fore-Ever (novel)	196	0.21s	0.21s	0.98x (faster)
American Revolution	293	0.49s	0.39s	1.27x
Rust Programming 3E	806	0.91s	0.73s	1.23x
Overall (17 PDFs)	6,663	6.40s	5.86s	1.09x

Overall ~9% slower than PyMuPDF, faster on 5 of 17 PDFs. Competitive on the rest.

What changed in v0.2.0

~2x faster than v0.1.x through Arc-based caching, cross-page font cache, zlib-ng backend, and reduced cloning
Fixed word boundary detection for PDFs that encode spaces as position gaps instead of literal space characters
Character coverage improved from partial to 99.7% of PyMuPDF across diverse PDF types

Development

# Set up environment
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest

# Build Rust extension in release mode
maturin develop --release

# Run Rust tests (198 tests)
cd rust && cargo test

# Run Python tests
pytest tests/python/

# Run benchmarks
pytest tests/python/benchmarks/ --benchmark-only

Project structure

botl-pdf/
├── rust/
│   ├── botl-pdf-core/        # Core engine (parser, text, layout, codecs)
│   ├── botl-pdf-python/      # PyO3 bindings → _core native module
│   └── botl-pdf-csys/        # Image codec FFI (JPEG, JPEG2000)
├── python/botl_pdf/          # High-level Python API
│   ├── document.py           # Document, PageCollection
│   ├── page.py               # Page wrapper
│   ├── export.py             # to_text(), to_markdown()
│   ├── debug.py              # VisualDebugger (Pillow overlays)
│   ├── tables.py             # Table/TableCell dataclasses
│   └── cli/main.py           # CLI: text, info, export
├── tests/
│   ├── rust/                 # Integration tests (parser, text, layout, geometry)
│   └── python/               # Unit + integration tests
└── docs/                     # Sphinx docs

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 7, 2026

This version

0.2.0

Apr 5, 2026

0.1.2

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botlpdf-0.2.0.tar.gz (81.5 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl (620.4 kB view details)

Uploaded Apr 5, 2026 CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file botlpdf-0.2.0.tar.gz.

File metadata

Download URL: botlpdf-0.2.0.tar.gz
Upload date: Apr 5, 2026
Size: 81.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for botlpdf-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`de6a5d7a1e8a88b4d6f0163efb241a27c4eb27dab304e6711c404fced490b38e`
MD5	`17b8ac9d8d2f33f3d90aa5876799d538`
BLAKE2b-256	`cec2864518e0da9efdc788745a542c4a2a7a76cb70b5269aa92d97e40d82b2c7`

See more details on using hashes here.

File details

Details for the file botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

Download URL: botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl
Upload date: Apr 5, 2026
Size: 620.4 kB
Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`80dd8136038300750156eadc08fcbecfccab26d4bd8e98136b4305570dcd27c5`
MD5	`36ee4c3148fde93abb5839be7a0f3a6c`
BLAKE2b-256	`8794e55cafe31fb66e2e1c6d128eb827b6abc803dc9de877580ed5547b207165`

See more details on using hashes here.

botlpdf 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

botl-pdf

Features

Install

Quick Start

Opening Documents

From a file path

From bytes

As a context manager

Text Extraction

Plain text (default)

Layout-preserving text

Tuning extraction parameters

Exporting entire documents

Character-Level Access

Inspecting individual characters

Finding text by style

Extracting text from a region

Run ID tracking

Document Metadata

Table of Contents

Building a page lookup from TOC

Geometric Elements

Lines

Rectangles

Detecting horizontal rules

Page Properties

Visual Debugging

CLI

Extract text

Show metadata

Export

API Reference

botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document

Document

Page (via doc.pages[i])

Char

BBox

TOCEntry

GeomLine

GeomRect

LayoutParams

Architecture

Benchmarks

Text Extraction Quality

Performance

What changed in v0.2.0

Development

Project structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document`

`Document`

`Page` (via `doc.pages[i]`)

`Char`

`BBox`

`TOCEntry`

`GeomLine`

`GeomRect`

`LayoutParams`