High-performance PDF processing: extract text, tables, images with a Rust + C core.
Project description
botl-pdf
High-performance PDF text extraction library with a custom Rust core and Python bindings. No dependency on poppler, pdfium, or pdfbox — the entire PDF parsing and text extraction pipeline is written from scratch.
Features
- Fast text extraction with layout analysis
- Character-level output with bounding boxes, fonts, colors, and styles
- Layout-preserving text extraction (spatial whitespace)
- Table of contents (TOC/outline) extraction with page numbers
- Document metadata extraction (title, author, dates, etc.)
- Geometric element extraction (lines, rectangles)
- Configurable layout parameters (word spacing, line grouping, reading order)
- Run-aware de-interleaving for correct reading order on complex PDFs
- Pythonic API with type hints throughout
- CLI for common operations
- Zero external PDF library dependencies
Install
pip install botlpdf
Build from source (requires Rust toolchain):
pip install maturin
git clone https://github.com/Shivamjohri247/botl-pdf.git
cd botl-pdf
maturin develop --release
Quick Start
import botl_pdf
doc = botl_pdf.open("report.pdf")
text = doc.pages[0].extract_text()
print(text)
Opening Documents
From a file path
import botl_pdf
doc = botl_pdf.open("report.pdf")
print(f"Pages: {doc.num_pages}")
print(f"Encrypted: {doc.is_encrypted}")
From bytes
with open("report.pdf", "rb") as f:
data = f.read()
doc = botl_pdf.open(data)
print(f"Pages: {doc.num_pages}")
As a context manager
with botl_pdf.open("report.pdf") as doc:
text = doc.pages[0].extract_text()
Text Extraction
Plain text (default)
Returns clean, readable text. Blocks are separated by double newlines, lines by single newlines, words by spaces.
doc = botl_pdf.open("report.pdf")
# Single page
text = doc.pages[0].extract_text()
print(text)
# All pages
for page in doc.pages:
print(page.extract_text())
# Subscript access (0-based, supports negative)
text_last = doc.pages[-1].extract_text()
Layout-preserving text
Maintains spatial positioning using proportional spaces between words. Useful when you need to preserve visual alignment of columns, tables, or indented text.
doc = botl_pdf.open("financial_report.pdf")
page = doc.pages[0]
# Layout mode preserves spatial whitespace
layout_text = page.extract_text(layout=True)
print(layout_text)
Tuning extraction parameters
import botl_pdf
doc = botl_pdf.open("two_column.pdf")
# Tighter word grouping (merge chars closer together)
params = botl_pdf.LayoutParams(
word_margin=1.5, # max horizontal gap in same word (× font_size), default 2.0
line_margin=0.5, # max vertical gap in same block (× line height), default 0.5
boxes_flow=0.5, # reading order: 0.0=horizontal, 1.0=vertical, default 0.5
)
text = doc.pages[0].extract_text(layout=True, layout_params=params)
Exporting entire documents
from botl_pdf.export import to_text, to_markdown
# Plain text for all pages
full_text = to_text("report.pdf")
# Layout-preserved text
full_text_layout = to_text("report.pdf", layout=True)
# Markdown (pages separated by horizontal rules)
markdown = to_markdown("report.pdf")
# Specific page range only
markdown_subset = to_markdown("report.pdf", pages=range(0, 5))
Character-Level Access
Each page exposes individual characters with full style information: bounding box, font name, font size, bold/italic flags, fill and stroke colors, rotation, and run ID.
Inspecting individual characters
doc = botl_pdf.open("report.pdf")
page = doc.pages[0]
for char in page.chars[:5]:
print(f" char={char.text!r} "
f"pos=({char.bbox.x0:.1f}, {char.bbox.y0:.1f}) "
f"size={char.font_size:.1f} "
f"font={char.font_name}")
Output:
char='H' pos=(100.0, 700.0) size=12.0 font=F1
char='e' pos=(108.0, 700.0) size=12.0 font=F1
char='l' pos=(115.0, 700.0) size=12.0 font=F1
char='l' pos=(120.0, 700.0) size=12.0 font=F1
char='o' pos=(125.0, 700.0) size=12.0 font=F1
Finding text by style
# Find all bold characters on page 0
bold_chars = [c for c in doc.pages[0].chars if c.bold]
bold_text = "".join(c.text for c in bold_chars)
# Find characters in a specific color (e.g., red links)
red_chars = [
c for c in doc.pages[0].chars
if c.color and c.color[0] > 0.8 and c.color[1] < 0.2 and c.color[2] < 0.2
]
# Find large decorative initials (font size > 30)
initials = [c for c in doc.pages[0].chars if c.font_size > 30]
for c in initials:
print(f"Decorative initial: {c.text!r} at size {c.font_size:.0f}")
Extracting text from a region
# Get all text in a specific rectangular area
x0, y0, x1, y1 = 100.0, 600.0, 400.0, 700.0
region_chars = [
c for c in doc.pages[0].chars
if c.bbox.x0 >= x0 and c.bbox.x1 <= x1
and c.bbox.y0 >= y0 and c.bbox.y1 <= y1
]
region_text = "".join(c.text for c in region_chars)
print(region_text)
Run ID tracking
Characters from the same text-showing operation (Tj/TJ) share a run_id. This lets you group characters by their PDF text operation — useful for debugging extraction issues or understanding the PDF's internal structure.
from collections import defaultdict
# Group characters by their source text operation
runs = defaultdict(str)
for c in doc.pages[0].chars:
runs[c.run_id] += c.text
for run_id, text in sorted(runs.items()):
print(f" Run {run_id}: {text[:60]!r}")
Document Metadata
doc = botl_pdf.open("report.pdf")
meta = doc.metadata
print(f"Title: {meta.get('title')}")
print(f"Author: {meta.get('author')}")
print(f"Subject: {meta.get('subject')}")
print(f"Creator: {meta.get('creator')}")
print(f"Producer: {meta.get('producer')}")
print(f"Created: {meta.get('creation_date')}")
print(f"Modified: {meta.get('mod_date')}")
print(f"Version: {meta.get('version')}")
Table of Contents
doc = botl_pdf.open("book.pdf")
toc = doc.toc
for entry in toc:
indent = " " * entry.level
page = entry.page_number
print(f"{indent}{entry.title} → page {page}")
Output:
Preface → page 5
Acknowledgments → page 7
Part I. Foundations → page 11
Chapter 1. Introduction → page 13
Chapter 2. Methods → page 27
Part II. Results → page 45
Chapter 3. Analysis → page 47
Building a page lookup from TOC
# Map page numbers to their chapter titles
chapters = {}
current_chapter = None
for entry in doc.toc:
if entry.level == 0 and entry.page_number is not None:
current_chapter = entry.title
if current_chapter and entry.page_number is not None:
chapters[entry.page_number] = current_chapter
# Find which chapter a page belongs to
def chapter_for_page(page_idx):
page_nums = sorted(chapters.keys())
for i, p in enumerate(page_nums):
if page_idx < p:
return chapters[page_nums[max(0, i - 1)]] if i > 0 else None
return chapters[page_nums[-1]]
print(f"Page 30 is in: {chapter_for_page(30)}")
Geometric Elements
Pages expose geometric lines and rectangles drawn on the PDF canvas — useful for detecting table borders, rules, decorative elements, and form fields.
Lines
page = doc.pages[0]
for line in page.lines:
print(f" Line ({line.x0:.1f},{line.y0:.1f}) → ({line.x1:.1f},{line.y1:.1f}) "
f"width={line.line_width:.1f}")
Rectangles
for rect in page.rects:
fill = rect.fill_color
stroke = rect.stroke_color
print(f" Rect ({rect.bbox.x0:.1f},{rect.bbox.y0:.1f})-"
f"({rect.bbox.x1:.1f},{rect.bbox.y1:.1f}) "
f"stroke={stroke} fill={fill}")
Detecting horizontal rules
# Find horizontal lines (useful for detecting separators/tables)
h_rules = [
line for line in page.lines
if abs(line.y1 - line.y0) < 1.0 and (line.x1 - line.x0) > 50.0
]
for rule in h_rules:
print(f"Horizontal rule at y={rule.y0:.1f} from x={rule.x0:.1f} to x={rule.x1:.1f}")
Page Properties
doc = botl_pdf.open("report.pdf")
for i, page in enumerate(doc.pages):
print(f"Page {i}: {page.width:.0f}×{page.height:.0f}pt "
f"rotation={page.rotation}° "
f"label={page.label!r}")
Output:
Page 0: 612×792pt rotation=0° label='1'
Page 1: 612×792pt rotation=0° label='2'
Common page sizes:
- Letter: 612 × 792 pt (8.5" × 11")
- A4: 595 × 842 pt (210mm × 297mm)
Visual Debugging
Requires Pillow. Draws bounding boxes and geometric elements on a rendered page image — useful for debugging extraction issues or understanding PDF layout.
pip install botlpdf[debug]
from botl_pdf.debug import VisualDebugger
import botl_pdf
doc = botl_pdf.open("report.pdf")
page = doc.pages[0]
debugger = VisualDebugger(page)
# Draw character bounding boxes (red)
img = debugger.draw_chars(resolution=150)
img.save("debug_chars.png")
# Draw geometric lines (blue)
img = debugger.draw_lines(resolution=150)
img.save("debug_lines.png")
# Draw geometric rectangles (green)
img = debugger.draw_rects(resolution=150)
img.save("debug_rects.png")
# All elements layered together
img = debugger.draw_all(resolution=150)
img.save("debug_all.png")
CLI
pip install botlpdf[cli]
Extract text
# To stdout
botl-pdf text report.pdf
# To file
botl-pdf text report.pdf --output text.txt
# Specific pages
botl-pdf text report.pdf --pages 1-5
# Layout-preserved
botl-pdf text report.pdf --layout
Show metadata
botl-pdf info report.pdf
Output:
{
"version": "1.4",
"page_count": 42,
"encrypted": false,
"title": "Annual Report 2024",
"author": "Acme Corp",
"creator": "LaTeX",
"producer": "pdfTeX-1.40"
}
Export
# Markdown
botl-pdf export report.pdf --format markdown --output report.md
# Plain text
botl-pdf export report.pdf --format text --output report.txt
API Reference
botl_pdf.open(path_or_bytes, *, password=None, lazy=True) -> Document
Open a PDF from a file path (str) or raw bytes.
Document
| Property / Method | Type | Description |
|---|---|---|
.metadata |
dict |
Metadata fields: title, author, subject, keywords, creator, producer, creation_date, mod_date, version, page_count |
.num_pages |
int |
Number of pages |
.is_encrypted |
bool |
Whether the document is encrypted |
.toc |
list[TOCEntry] |
Table of contents / outline bookmarks |
.pages |
PageCollection |
Iterable, subscriptable page access |
doc[i] |
PyPage |
Shortcut for doc.pages[i] (supports negative indices) |
len(doc) |
int |
Same as .num_pages |
Page (via doc.pages[i])
| Property / Method | Type | Description |
|---|---|---|
.extract_text(layout=False, layout_params=None) |
str |
Extract text (plain or layout-preserved) |
.chars |
list[Char] |
All characters with full style info |
.lines |
list[GeomLine] |
Geometric lines on the page |
.rects |
list[GeomRect] |
Geometric rectangles on the page |
.width |
float |
Page width in points |
.height |
float |
Page height in points |
.rotation |
int |
Rotation in degrees (0, 90, 180, 270) |
.page_number |
int |
Zero-based page index |
.label |
str |
Page label string (e.g. "iii", "A-1") |
Char
| Property | Type | Description |
|---|---|---|
.text |
str |
Unicode character |
.bbox |
BBox |
Bounding box |
.font_name |
str |
Font resource name (e.g. "F1") |
.font_size |
float |
Font size in points |
.bold |
bool |
Bold flag |
.italic |
bool |
Italic flag |
.color |
tuple[float, float, float] or None |
Fill color (RGB, 0.0-1.0) |
.stroking_color |
tuple[float, float, float] or None |
Stroke color (RGB, 0.0-1.0) |
.rotation |
float |
Rotation in degrees |
.run_id |
int |
Text operation ID (chars from same Tj/TJ share this) |
BBox
| Property / Method | Type | Description |
|---|---|---|
.x0, .y0 |
float |
Top-left corner |
.x1, .y1 |
float |
Bottom-right corner |
.width |
float |
Width (x1 - x0) |
.height |
float |
Height (y1 - y0) |
.center() |
(float, float) |
Center point |
.area() |
float |
Area |
TOCEntry
| Property | Type | Description |
|---|---|---|
.title |
str |
Outline entry title |
.level |
int |
Nesting depth (0 = top-level) |
.page_number |
int or None |
0-indexed destination page (None if unresolvable) |
.dest |
str or None |
Raw destination string |
GeomLine
| Property | Type | Description |
|---|---|---|
.x0, .y0 |
float |
Start point |
.x1, .y1 |
float |
End point |
.line_width |
float |
Stroke width |
.color |
tuple or None |
RGB color (0.0-1.0) |
GeomRect
| Property | Type | Description |
|---|---|---|
.bbox |
BBox |
Bounding box |
.line_width |
float |
Stroke width |
.stroke_color |
tuple or None |
Stroke RGB color |
.fill_color |
tuple or None |
Fill RGB color |
LayoutParams
| Parameter | Type | Default | Description |
|---|---|---|---|
word_margin |
float |
2.0 |
Max horizontal gap between chars in same word, as a multiple of font size |
line_margin |
float |
0.5 |
Max vertical gap between lines in same block, as a multiple of line height |
boxes_flow |
float |
0.5 |
Reading-order direction (0.0 = strict horizontal, 1.0 = strict vertical) |
params = botl_pdf.LayoutParams(word_margin=1.5, line_margin=0.3, boxes_flow=0.0)
text = page.extract_text(layout=True, layout_params=params)
Architecture
PDF bytes
→ Parser (nom tokenizer + recursive-descent objects)
→ Content stream interpreter (Tj/TJ/q/Q/cm operators)
→ Character extraction (CMap, fonts, glyph widths)
→ Layout analysis (chars → words → lines → blocks)
→ Reading order (column detection, run de-interleaving)
→ Text output (plain or layout-preserved)
The pipeline is entirely custom Rust — no dependency on poppler, pdfium, pdfbox, or any other PDF library.
Key design decisions:
- Run-aware de-interleaving — Each Tj/TJ text operation tags characters with a
run_id. When PDF producers interleave characters from different operations at alternating x-positions, the layout engine detects this and groups by run, preserving correct reading order. - Font-band separation — Within a line, characters are grouped by font size to handle decorative initials and mixed-size text on the same visual line.
- Lazy extraction — Page content is decoded on first access and cached. The parsed
Documentis shared across pages viaArc<Mutex>, so there's no per-page re-parsing.
Benchmarks
Tested against PyMuPDF on real-world PDFs (textbooks, novels, academic papers). v0.2.0 includes performance optimizations and improved word boundary detection.
Text Extraction Quality
| Pages | botl-pdf words | PyMuPDF words | Word coverage | |
|---|---|---|---|---|
| Acrimonious (novel) | 408 | 118,767 | 110,314 | 107.7% |
| Agentic Mesh (tech) | 558 | 136,669 | 132,386 | 103.2% |
| Azure Fundamentals | 576 | 89,490 | 87,183 | 102.6% |
| Data Science (textbook) | 438 | 100,594 | 93,286 | 107.8% |
| Discrete Math (textbook) | 565 | 93,691 | 89,968 | 104.1% |
| Mastering AI System Design | 1,038 | 85,854 | 82,608 | 103.9% |
| System Design Interview | 341 | 47,769 | 46,523 | 102.7% |
| American Revolution | 293 | 107,411 | 99,897 | 107.5% |
| Rust Programming 3E | 806 | 203,941 | 196,748 | 103.7% |
| Total | 6,663 | 1,399,763 | 1,331,742 | 105.1% |
Character-level coverage: 99.7% of PyMuPDF. botl-pdf extracts 5% more words overall.
Performance
| Pages | botl-pdf | PyMuPDF | Ratio | |
|---|---|---|---|---|
| Mastering AI System Design | 1,038 | 0.56s | 0.72s | 0.78x (faster) |
| System Design Interview | 341 | 0.21s | 0.31s | 0.66x (faster) |
| Discrete Math | 565 | 0.45s | 0.45s | 1.00x (equal) |
| Faking Fore-Ever (novel) | 196 | 0.21s | 0.21s | 0.98x (faster) |
| American Revolution | 293 | 0.49s | 0.39s | 1.27x |
| Rust Programming 3E | 806 | 0.91s | 0.73s | 1.23x |
| Overall (17 PDFs) | 6,663 | 6.40s | 5.86s | 1.09x |
Overall ~9% slower than PyMuPDF, faster on 5 of 17 PDFs. Competitive on the rest.
What changed in v0.2.0
- ~2x faster than v0.1.x through Arc-based caching, cross-page font cache, zlib-ng backend, and reduced cloning
- Fixed word boundary detection for PDFs that encode spaces as position gaps instead of literal space characters
- Character coverage improved from partial to 99.7% of PyMuPDF across diverse PDF types
Development
# Set up environment
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest
# Build Rust extension in release mode
maturin develop --release
# Run Rust tests (198 tests)
cd rust && cargo test
# Run Python tests
pytest tests/python/
# Run benchmarks
pytest tests/python/benchmarks/ --benchmark-only
Project structure
botl-pdf/
├── rust/
│ ├── botl-pdf-core/ # Core engine (parser, text, layout, codecs)
│ ├── botl-pdf-python/ # PyO3 bindings → _core native module
│ └── botl-pdf-csys/ # Image codec FFI (JPEG, JPEG2000)
├── python/botl_pdf/ # High-level Python API
│ ├── document.py # Document, PageCollection
│ ├── page.py # Page wrapper
│ ├── export.py # to_text(), to_markdown()
│ ├── debug.py # VisualDebugger (Pillow overlays)
│ ├── tables.py # Table/TableCell dataclasses
│ └── cli/main.py # CLI: text, info, export
├── tests/
│ ├── rust/ # Integration tests (parser, text, layout, geometry)
│ └── python/ # Unit + integration tests
└── docs/ # Sphinx docs
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file botlpdf-0.2.0.tar.gz.
File metadata
- Download URL: botlpdf-0.2.0.tar.gz
- Upload date:
- Size: 81.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de6a5d7a1e8a88b4d6f0163efb241a27c4eb27dab304e6711c404fced490b38e
|
|
| MD5 |
17b8ac9d8d2f33f3d90aa5876799d538
|
|
| BLAKE2b-256 |
cec2864518e0da9efdc788745a542c4a2a7a76cb70b5269aa92d97e40d82b2c7
|
File details
Details for the file botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: botlpdf-0.2.0-cp38-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 620.4 kB
- Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80dd8136038300750156eadc08fcbecfccab26d4bd8e98136b4305570dcd27c5
|
|
| MD5 |
36ee4c3148fde93abb5839be7a0f3a6c
|
|
| BLAKE2b-256 |
8794e55cafe31fb66e2e1c6d128eb827b6abc803dc9de877580ed5547b207165
|