A more intuitive interface for working with PDFs
Project description
Natural PDF
A friendly library for working with PDFs, built on top of pdfplumber.
Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
Installation
pip install natural-pdf
Need OCR, semantic search, export, or AI-powered extraction? Install what you need:
pip install "natural-pdf[all]" # Recommended feature-complete install
pip install "natural-pdf[export]" # Export helpers only
pip install easyocr # Extra OCR backend
pip install "natural-pdf[paddle]" # PaddleOCR stack
pip install "surya-ocr<0.15" # Surya OCR engine
pip install doctr # Doctr OCR engine
More details in the installation guide.
natural-pdf[all] is the recommended feature-complete install for core features: the default RapidOCR engine, sentence-transformers-based semantic search, QA/extraction dependencies, and export support. It does not install every optional backend. Extra engines such as PaddleOCR, Surya, and Doctr stay opt-in, and Natural PDF will tell you what to install when you try to use something that is missing.
Quick Start
from natural_pdf import PDF
# Open a PDF
pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]
# Extract all of the text on the page
page.extract_text()
# Find elements using CSS-like selectors
heading = page.find('text:contains("Summary"):bold')
# Extract content below the heading
content = heading.below().extract_text()
# Examine all the bold text on the page
page.find_all('text:bold').show()
# Exclude parts of the page from selectors/extractors
header = page.find('text:contains("CONFIDENTIAL")').above()
footer = page.find_all('line')[-1].below()
page.add_exclusion(header)
page.add_exclusion(footer)
# Extract clean text from the page ignoring exclusions
clean_text = page.extract_text()
And as a fun bonus, page.viewer() will provide an interactive method to explore the PDF.
Key Features
Natural PDF offers a range of features for working with PDFs:
- CSS-like Selectors: Find elements using intuitive query strings (
page.find('text:bold')). - Spatial Navigation: Select content relative to other elements (
heading.below(),element.select_until(...)). - Text & Table Extraction: Get clean text or structured table data, automatically handling exclusions.
- OCR Integration: Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
- Layout Analysis: Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
- Document QA: Ask natural language questions about your document's content.
- Semantic Search: Rank pages within a PDF by semantic similarity using sentence-transformer embeddings.
- Visual Debugging: Highlight elements and use an interactive viewer or save images to understand your selections.
Learn More
Dive deeper into the features and explore advanced usage in the Complete Documentation.
Extending Natural PDF
Natural PDF now exposes its pluggable engines through small helper functions so you rarely have to touch the core registry directly. Two handy entry points:
from natural_pdf.tables import register_table_function
def table_delim(region, *, context=None, **kwargs):
# return a TableResult or list-of-lists
...
register_table_function("table_delim", table_delim)
from natural_pdf.selectors import register_selector_engine
class DebugSelectorEngine:
def query(self, *, context, selector, options):
...
register_selector_engine("debug", lambda **_: DebugSelectorEngine())
Best friends
Natural PDF sits on top of a lot of fantastic tools and mdoels, some of which are:
- pdfplumber
- EasyOCR
- PaddleOCR
- Surya
- A specific YOLO
- doctr
- docling
- Hugging Face
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file natural_pdf-0.6.1.tar.gz.
File metadata
- Download URL: natural_pdf-0.6.1.tar.gz
- Upload date:
- Size: 716.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
271a1fd27a17ae8618fa3294e1d6e3970e18a93ae1daf123fb50ae04430c36d7
|
|
| MD5 |
5696b340233abee866abf6e61b6be3d9
|
|
| BLAKE2b-256 |
b82f961e9e8d435e096fbb757c8f9197e759672c6faff38ffdd40caf7ac9ecbc
|
File details
Details for the file natural_pdf-0.6.1-py3-none-any.whl.
File metadata
- Download URL: natural_pdf-0.6.1-py3-none-any.whl
- Upload date:
- Size: 739.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
310c501f9fa7fdb8ae9a76cbeb3b073548b4e451ef7713c9a95399e70c701340
|
|
| MD5 |
dfbaffaa2ab05626861da06483a1ab5a
|
|
| BLAKE2b-256 |
3e2072ce3999aae3f96819efa06a02e5994720761700f832c610e0450dfbfc09
|