Skip to main content

Extract text from PDFs with position data

Project description

pdf-strings

Extract text from PDFs with position data.

Installation

pip install pdf-strings

Quick Start

from pdf_strings import from_path

# Extract text from a PDF
output = from_path("document.pdf")
print(output)  # Plain text

API Reference

Functions

from_path(path: str, *, password: str | None = None) -> TextOutput

Extract text from a PDF file.

Parameters:

  • path (str): Path to the PDF file
  • password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_path

# Basic usage
output = from_path("document.pdf")

# With password
output = from_path("encrypted.pdf", password="secret")

from_bytes(data: bytes, *, password: str | None = None) -> TextOutput

Extract text from PDF bytes.

Parameters:

  • data (bytes): PDF file contents as bytes
  • password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_bytes

with open("document.pdf", "rb") as f:
    data = f.read()

output = from_bytes(data)

Classes

TextOutput

Container for extracted text with structured data.

Attributes:

  • lines (List[List[TextSpan]]): Lines of text, each containing multiple spans

Methods:

to_string() -> str

Get plain text output (concatenates all text with spaces).

output = from_path("document.pdf")
plain_text = output.to_string()
# or simply:
plain_text = str(output)
to_string_pretty() -> str

Get formatted text that preserves spatial layout using a character grid.

output = from_path("document.pdf")
formatted_text = output.to_string_pretty()
# or using format spec:
formatted_text = f"{output:#}"

Magic Methods:

  • __str__(): Returns plain text (same as to_string())
  • __format__(format_spec): Use # for pretty formatting: f"{output:#}"

TextSpan

A span of text with position and metadata.

Attributes:

  • text (str): The text content
  • bbox (BoundingBox): Bounding box coordinates
  • font_size (float): Font size in points
  • page (int): Page number (0-indexed)

Example:

output = from_path("document.pdf")
for line in output.lines:
    for span in line:
        print(f"'{span.text}' at size {span.font_size}pt on page {span.page}")
        print(f"  Position: {span.bbox}")

BoundingBox

Bounding box coordinates for a text span.

Attributes:

  • top (float): Top coordinate
  • right (float): Right coordinate
  • bottom (float): Bottom coordinate
  • left (float): Left coordinate

String representation: (top, right, bottom, left) following HTML margin convention.

Example:

bbox = span.bbox
print(f"Top-left: ({bbox.left}, {bbox.top})")
print(f"Width: {bbox.right - bbox.left}")
print(f"Height: {bbox.top - bbox.bottom}")

Usage Examples

Extract all text

from pdf_strings import from_path

output = from_path("document.pdf")
print(output.to_string())

Preserve layout

from pdf_strings import from_path

output = from_path("invoice.pdf")
# Character grid rendering preserves columns and spacing
print(output.to_string_pretty())

Access structured data

from pdf_strings import from_path

output = from_path("document.pdf")

for line_idx, line in enumerate(output.lines):
    print(f"Line {line_idx}:")
    for span in line:
        print(f"  {span.text}")
        print(f"    Font size: {span.font_size}")
        print(f"    Position: ({span.bbox.left}, {span.bbox.top})")
        print(f"    Page: {span.page}")

Find text in specific regions

from pdf_strings import from_path

output = from_path("document.pdf")

# Find text in the top-right corner
for line in output.lines:
    for span in line:
        if span.bbox.top < 100 and span.bbox.left > 400:
            print(f"Top-right text: {span.text}")

Extract tables by position

from pdf_strings import from_path

output = from_path("table.pdf")

# Group spans by their vertical position (rows)
rows = {}
for line in output.lines:
    for span in line:
        row_key = round(span.bbox.top / 10) * 10  # Group by ~10pt vertical bands
        if row_key not in rows:
            rows[row_key] = []
        rows[row_key].append((span.bbox.left, span.text))

# Print rows sorted by vertical position
for y_pos in sorted(rows.keys(), reverse=True):
    # Sort spans in each row by horizontal position
    row_spans = sorted(rows[y_pos], key=lambda x: x[0])
    print(" | ".join(text for _, text in row_spans))

Features

  • Plain text extraction
  • Spatial layout preservation via character grid
  • Bounding box coordinates for every text span
  • Font size and page information
  • Password-protected PDF support
  • Handles complex fonts, rotated text, and multi-column layouts
  • Works with all Python 3.11+ versions

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_strings-0.1.2-py3-none-win_amd64.whl (944.1 kB view details)

Uploaded Python 3Windows x86-64

pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl (9.0 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file pdf_strings-0.1.2-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pdf_strings-0.1.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 4d74a7ad96760f4bca4b4e61fa2dfbb64269e1fa5f4c2e05d1239b7e90496796
MD5 1f4f6cadfee0144b05e530edf80c26c8
BLAKE2b-256 c967495028f54cb90bc363ce1ac2d8750935c70dac69fcbe508dd5cd6da7d961

See more details on using hashes here.

File details

Details for the file pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 7e024592240dbce9a8ca61cbdd0a565c2383030b583a830866f792e7b94f61cd
MD5 72053c451894914069b213595c316063
BLAKE2b-256 48d184347688f07b766ba1eb00a7f4bac2b8b58cc1e4accfb24538bf4abf8e24

See more details on using hashes here.

File details

Details for the file pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 943014022619f99df05a8cbaf6ebfb59b475b73f029ac578fc0753a2e22289ee
MD5 c09441a3ea189ea6e2d51396ad7beff3
BLAKE2b-256 1cc2d42c84748ac64d6dd5b4cc08478c0f70b507cdd9a2f440f557c5f0120366

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page