Extract text from PDFs with position data

These details have not been verified by PyPI

Project description

pdf-strings

Extract text from PDFs with position data.

Installation

pip install pdf-strings

Quick Start

from pdf_strings import from_path

# Extract text from a PDF
output = from_path("document.pdf")
print(output)  # Plain text

API Reference

Functions

`from_path(path: str, *, password: str | None = None) -> TextOutput`

Extract text from a PDF file.

Parameters:

path (str): Path to the PDF file
password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_path

# Basic usage
output = from_path("document.pdf")

# With password
output = from_path("encrypted.pdf", password="secret")

`from_bytes(data: bytes, *, password: str | None = None) -> TextOutput`

Extract text from PDF bytes.

Parameters:

data (bytes): PDF file contents as bytes
password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_bytes

with open("document.pdf", "rb") as f:
    data = f.read()

output = from_bytes(data)

Classes

`TextOutput`

Container for extracted text with structured data.

Attributes:

lines (List[List[TextSpan]]): Lines of text, each containing multiple spans

Methods:

`to_string() -> str`

Get plain text output (concatenates all text with spaces).

output = from_path("document.pdf")
plain_text = output.to_string()
# or simply:
plain_text = str(output)

`to_string_pretty() -> str`

Get formatted text that preserves spatial layout using a character grid.

output = from_path("document.pdf")
formatted_text = output.to_string_pretty()
# or using format spec:
formatted_text = f"{output:#}"

Magic Methods:

__str__(): Returns plain text (same as to_string())
__format__(format_spec): Use # for pretty formatting: f"{output:#}"

`TextSpan`

A span of text with position and metadata.

Attributes:

text (str): The text content
bbox (BoundingBox): Bounding box coordinates
font_size (float): Font size in points
page (int): Page number (0-indexed)

Example:

output = from_path("document.pdf")
for line in output.lines:
    for span in line:
        print(f"'{span.text}' at size {span.font_size}pt on page {span.page}")
        print(f"  Position: {span.bbox}")

`BoundingBox`

Bounding box coordinates for a text span.

Attributes:

top (float): Top coordinate
right (float): Right coordinate
bottom (float): Bottom coordinate
left (float): Left coordinate

String representation: (top, right, bottom, left) following HTML margin convention.

Example:

bbox = span.bbox
print(f"Top-left: ({bbox.left}, {bbox.top})")
print(f"Width: {bbox.right - bbox.left}")
print(f"Height: {bbox.top - bbox.bottom}")

Usage Examples

Extract all text

from pdf_strings import from_path

output = from_path("document.pdf")
print(output.to_string())

Preserve layout

from pdf_strings import from_path

output = from_path("invoice.pdf")
# Character grid rendering preserves columns and spacing
print(output.to_string_pretty())

Access structured data

from pdf_strings import from_path

output = from_path("document.pdf")

for line_idx, line in enumerate(output.lines):
    print(f"Line {line_idx}:")
    for span in line:
        print(f"  {span.text}")
        print(f"    Font size: {span.font_size}")
        print(f"    Position: ({span.bbox.left}, {span.bbox.top})")
        print(f"    Page: {span.page}")

Find text in specific regions

from pdf_strings import from_path

output = from_path("document.pdf")

# Find text in the top-right corner
for line in output.lines:
    for span in line:
        if span.bbox.top < 100 and span.bbox.left > 400:
            print(f"Top-right text: {span.text}")

Extract tables by position

from pdf_strings import from_path

output = from_path("table.pdf")

# Group spans by their vertical position (rows)
rows = {}
for line in output.lines:
    for span in line:
        row_key = round(span.bbox.top / 10) * 10  # Group by ~10pt vertical bands
        if row_key not in rows:
            rows[row_key] = []
        rows[row_key].append((span.bbox.left, span.text))

# Print rows sorted by vertical position
for y_pos in sorted(rows.keys(), reverse=True):
    # Sort spans in each row by horizontal position
    row_spans = sorted(rows[y_pos], key=lambda x: x[0])
    print(" | ".join(text for _, text in row_spans))

Features

Plain text extraction
Spatial layout preservation via character grid
Bounding box coordinates for every text span
Font size and page information
Password-protected PDF support
Handles complex fonts, rotated text, and multi-column layouts
Works with all Python 3.11+ versions

License

MIT

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

This version

0.1.2

Nov 2, 2025

0.1.1

Oct 27, 2025

0.1.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_strings-0.1.2-py3-none-win_amd64.whl (944.1 kB view details)

Uploaded Nov 2, 2025 Python 3Windows x86-64

pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl (9.0 MB view details)

Uploaded Nov 2, 2025 Python 3manylinux: glibc 2.17+ x86-64

pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded Nov 2, 2025 Python 3macOS 11.0+ ARM64

File details

Details for the file pdf_strings-0.1.2-py3-none-win_amd64.whl.

File metadata

Download URL: pdf_strings-0.1.2-py3-none-win_amd64.whl
Upload date: Nov 2, 2025
Size: 944.1 kB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for pdf_strings-0.1.2-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`4d74a7ad96760f4bca4b4e61fa2dfbb64269e1fa5f4c2e05d1239b7e90496796`
MD5	`1f4f6cadfee0144b05e530edf80c26c8`
BLAKE2b-256	`c967495028f54cb90bc363ce1ac2d8750935c70dac69fcbe508dd5cd6da7d961`

See more details on using hashes here.

File details

Details for the file pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl.

File metadata

Download URL: pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl
Upload date: Nov 2, 2025
Size: 9.0 MB
Tags: Python 3, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl
Algorithm	Hash digest
SHA256	`7e024592240dbce9a8ca61cbdd0a565c2383030b583a830866f792e7b94f61cd`
MD5	`72053c451894914069b213595c316063`
BLAKE2b-256	`48d184347688f07b766ba1eb00a7f4bac2b8b58cc1e4accfb24538bf4abf8e24`

See more details on using hashes here.

File details

Details for the file pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl
Upload date: Nov 2, 2025
Size: 1.0 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`943014022619f99df05a8cbaf6ebfb59b475b73f029ac578fc0753a2e22289ee`
MD5	`c09441a3ea189ea6e2d51396ad7beff3`
BLAKE2b-256	`1cc2d42c84748ac64d6dd5b4cc08478c0f70b507cdd9a2f440f557c5f0120366`

See more details on using hashes here.

pdf-strings 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf-strings

Installation

Quick Start

API Reference

Functions

from_path(path: str, *, password: str | None = None) -> TextOutput

from_bytes(data: bytes, *, password: str | None = None) -> TextOutput

Classes

TextOutput

to_string() -> str

to_string_pretty() -> str

TextSpan

BoundingBox

Usage Examples

Extract all text

Preserve layout

Access structured data

Find text in specific regions

Extract tables by position

Features

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`from_path(path: str, *, password: str | None = None) -> TextOutput`

`from_bytes(data: bytes, *, password: str | None = None) -> TextOutput`

`TextOutput`

`to_string() -> str`

`to_string_pretty() -> str`

`TextSpan`

`BoundingBox`