A lightweight, high-performance PDF text extraction library with a pandas-style API.

These details have not been verified by PyPI

Project links

Project description

justpdf - PDF Reader

A lightweight, high-performance PDF text extraction library with a pandas-style API.

Features

Pandas-style API - Simple one-liners like justpdf.read_pdf('file.pdf')
Zero dependencies - Pure Python with stdlib only
High performance - LRU caching, lazy loading, optimized regex
PyPDF2-compatible - Easy migration from PyPDF2
Google Docs PDFs - Full support for Google Docs exported PDFs with ToUnicode CMap parsing
Clean API - Intuitive and simple to use

Installation

Install from PyPI:

pip install -i https://test.pypi.org/simple/ justpdf==0.0.1

Or clone this repo:

git clone https://github.com/Pujan-Dev/justpdf.git
cd justpdf

Quick Start

Pandas-style API (Recommended)

import justpdf

# Read entire PDF (like pd.read_csv)
text = justpdf.read_pdf('document.pdf')

# Read specific pages (0-indexed)
text = justpdf.read_pdf('document.pdf', pages=[0, 1, 2])

# Get PDF info
info = justpdf.read_pdf_info('document.pdf')
print(f"Pages: {info['page_count']}")

# Search for text
results = justpdf.search_pdf('document.pdf', 'keyword')
for r in results:
    print(f"Page {r['page']}: {r['text']}")

PyPDF2-style API

import justpdf

# Create reader
reader = justpdf.PdfReader('document.pdf')

# Get page count
print(f"Pages: {reader.page_count}")

# Extract text
text = reader.extract_text()

# Access individual pages
page_text = reader.pages[0].text

# Search
results = reader.search('keyword')

# Get metadata
print(reader.metadata)

API Reference

Pandas-style Functions

`read_pdf(file_path, pages=None)`

Extract text from PDF.

Args:

file_path (str): Path to PDF file
pages (list, optional): List of page numbers (0-indexed)

Returns: Extracted text as string

Example:

# All pages
text = justpdf.read_pdf('doc.pdf')

# Specific pages
text = justpdf.read_pdf('doc.pdf', pages=[0, 2, 4])

`read_pdf_info(file_path)`

Get PDF information.

Returns: Dict with file_path, page_count, metadata

`search_pdf(file_path, query, case_sensitive=False)`

Search for text in PDF.

Returns: List of dicts with page, line, text keys

PdfReader Class

`PdfReader(file_path)`

Initialize PDF reader.

Properties:

pages - List of PDFPage objects
page_count - Number of pages
metadata - PDF metadata dict

Methods:

extract_text(pages=None) - Extract text
search(query, case_sensitive=False) - Search text
get_info() - Get PDF info

Performance

justpdf is optimized for speed with multiple techniques:

Object indexing - O(1) object lookup instead of O(n) regex searches
Pre-compiled patterns - Regex patterns compiled once at class level
LRU caching - Decompressed streams cached (256 entry cache)
Lazy loading - Pages load only when accessed
Quick checks - Fast byte searches before expensive regex

Benchmark: 704-page PDF (598K characters):

Initial load: 1.4ms
Page extraction: 21ms/page
Total extraction: ~15s (all 704 pages)
Cached access: 0ms (instant!)
Re-extract: 0.2ms (all pages cached)

Comparison: Small PDF (1 page):

justpdf: 3ms
PyPDF2: 10ms
3x faster!

Examples

See demo.py for comprehensive examples:

python demo.py

Or try the quick demo:

python main.py

Supported Features

✅ Text extraction (Tj, TJ, ', " operators)
✅ Hex-encoded strings ( format)
✅ Google Docs PDFs (ToUnicode CMap parsing)
✅ Multiple encodings (UTF-8, UTF-16, Latin-1, CP1252)
✅ FlateDecode compression
✅ Metadata extraction
✅ Text search
✅ Page-by-page access

❌ Encrypted PDFs (not supported)
❌ Image extraction (metadata only)
❌ PDF writing/modification

Note: Fully supports Google Docs exported PDFs with custom font encodings via ToUnicode CMap parsing.

Requirements

Python 3.11+
No external dependencies!

License

Open source - free to use and modify under the Apache License 2.0.

Why justpdf?

Simple: Clean API like pandas - justpdf.read_pdf('file.pdf')
Fast: Optimized with caching and lazy loading
Lightweight: Zero dependencies, pure Python
Compatible: PyPDF2-style API for easy migration

Perfect for:

Text extraction from PDFs
PDF content analysis
Document processing pipelines
Data extraction workflows

Made with ❤️ by Pujan Neupane — Fast, simple, and powerful PDF reading for Python.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Mar 17, 2026

0.0.1

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

justpdf-0.0.2.tar.gz (11.3 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

justpdf-0.0.2-py3-none-any.whl (12.7 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file justpdf-0.0.2.tar.gz.

File metadata

Download URL: justpdf-0.0.2.tar.gz
Upload date: Mar 17, 2026
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for justpdf-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`065af153a25396455d75616cb622de2b455e39b02b5972ff9177be7cc5db7212`
MD5	`9a81360c20dbbedcace0d3f48830b9b8`
BLAKE2b-256	`a920ffb425b339593cdc7ad2ff4d5d1f25bf7e4be6adc88aa4af2c3e455f46ef`

See more details on using hashes here.

File details

Details for the file justpdf-0.0.2-py3-none-any.whl.

File metadata

Download URL: justpdf-0.0.2-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for justpdf-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c310d208da212f7e80c38f23235553f152e894bbd3979be0a6d34a577b052e64`
MD5	`36fd55a792ad448c6cb7ed32fbc60990`
BLAKE2b-256	`455b6983dca0a5552dcb7740da3dff719a7bf534aced2f9f8e4a181fa4d6f205`

See more details on using hashes here.

justpdf 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

justpdf - PDF Reader

Features

Installation

Quick Start

Pandas-style API (Recommended)

PyPDF2-style API

API Reference

Pandas-style Functions

read_pdf(file_path, pages=None)

read_pdf_info(file_path)

search_pdf(file_path, query, case_sensitive=False)

PdfReader Class

PdfReader(file_path)

Performance

Examples

Supported Features

Requirements

License

Why justpdf?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`read_pdf(file_path, pages=None)`

`read_pdf_info(file_path)`

`search_pdf(file_path, query, case_sensitive=False)`

`PdfReader(file_path)`