Extract text from PDFs with position data
Project description
pdf-strings
Extract text from PDFs with position data.
Installation
pip install pdf-strings
Quick Start
from pdf_strings import from_path
# Extract text from a PDF
output = from_path("document.pdf")
print(output) # Plain text
API Reference
Functions
from_path(path: str, *, password: str | None = None) -> TextOutput
Extract text from a PDF file.
Parameters:
path(str): Path to the PDF filepassword(str, optional): Password for encrypted PDFs
Returns: TextOutput object containing structured lines and spans
Example:
from pdf_strings import from_path
# Basic usage
output = from_path("document.pdf")
# With password
output = from_path("encrypted.pdf", password="secret")
from_bytes(data: bytes, *, password: str | None = None) -> TextOutput
Extract text from PDF bytes.
Parameters:
data(bytes): PDF file contents as bytespassword(str, optional): Password for encrypted PDFs
Returns: TextOutput object containing structured lines and spans
Example:
from pdf_strings import from_bytes
with open("document.pdf", "rb") as f:
data = f.read()
output = from_bytes(data)
Classes
TextOutput
Container for extracted text with structured data.
Attributes:
lines(List[List[TextSpan]]): Lines of text, each containing multiple spans
Methods:
to_string() -> str
Get plain text output (concatenates all text with spaces).
output = from_path("document.pdf")
plain_text = output.to_string()
# or simply:
plain_text = str(output)
to_string_pretty() -> str
Get formatted text that preserves spatial layout using a character grid.
output = from_path("document.pdf")
formatted_text = output.to_string_pretty()
# or using format spec:
formatted_text = f"{output:#}"
Magic Methods:
__str__(): Returns plain text (same asto_string())__format__(format_spec): Use#for pretty formatting:f"{output:#}"
TextSpan
A span of text with position and metadata.
Attributes:
text(str): The text contentbbox(BoundingBox): Bounding box coordinatesfont_size(float): Font size in pointspage(int): Page number (0-indexed)
Example:
output = from_path("document.pdf")
for line in output.lines:
for span in line:
print(f"'{span.text}' at size {span.font_size}pt on page {span.page}")
print(f" Position: {span.bbox}")
BoundingBox
Bounding box coordinates for a text span.
Attributes:
top(float): Top coordinateright(float): Right coordinatebottom(float): Bottom coordinateleft(float): Left coordinate
String representation: (top, right, bottom, left) following HTML margin convention.
Example:
bbox = span.bbox
print(f"Top-left: ({bbox.left}, {bbox.top})")
print(f"Width: {bbox.right - bbox.left}")
print(f"Height: {bbox.top - bbox.bottom}")
Usage Examples
Extract all text
from pdf_strings import from_path
output = from_path("document.pdf")
print(output.to_string())
Preserve layout
from pdf_strings import from_path
output = from_path("invoice.pdf")
# Character grid rendering preserves columns and spacing
print(output.to_string_pretty())
Access structured data
from pdf_strings import from_path
output = from_path("document.pdf")
for line_idx, line in enumerate(output.lines):
print(f"Line {line_idx}:")
for span in line:
print(f" {span.text}")
print(f" Font size: {span.font_size}")
print(f" Position: ({span.bbox.left}, {span.bbox.top})")
print(f" Page: {span.page}")
Find text in specific regions
from pdf_strings import from_path
output = from_path("document.pdf")
# Find text in the top-right corner
for line in output.lines:
for span in line:
if span.bbox.top < 100 and span.bbox.left > 400:
print(f"Top-right text: {span.text}")
Extract tables by position
from pdf_strings import from_path
output = from_path("table.pdf")
# Group spans by their vertical position (rows)
rows = {}
for line in output.lines:
for span in line:
row_key = round(span.bbox.top / 10) * 10 # Group by ~10pt vertical bands
if row_key not in rows:
rows[row_key] = []
rows[row_key].append((span.bbox.left, span.text))
# Print rows sorted by vertical position
for y_pos in sorted(rows.keys(), reverse=True):
# Sort spans in each row by horizontal position
row_spans = sorted(rows[y_pos], key=lambda x: x[0])
print(" | ".join(text for _, text in row_spans))
Features
- Plain text extraction
- Spatial layout preservation via character grid
- Bounding box coordinates for every text span
- Font size and page information
- Password-protected PDF support
- Handles complex fonts, rotated text, and multi-column layouts
- Works with all Python 3.11+ versions
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_strings-0.1.2-py3-none-win_amd64.whl.
File metadata
- Download URL: pdf_strings-0.1.2-py3-none-win_amd64.whl
- Upload date:
- Size: 944.1 kB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d74a7ad96760f4bca4b4e61fa2dfbb64269e1fa5f4c2e05d1239b7e90496796
|
|
| MD5 |
1f4f6cadfee0144b05e530edf80c26c8
|
|
| BLAKE2b-256 |
c967495028f54cb90bc363ce1ac2d8750935c70dac69fcbe508dd5cd6da7d961
|
File details
Details for the file pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl.
File metadata
- Download URL: pdf_strings-0.1.2-py3-none-manylinux_2_17_x86_64.whl
- Upload date:
- Size: 9.0 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e024592240dbce9a8ca61cbdd0a565c2383030b583a830866f792e7b94f61cd
|
|
| MD5 |
72053c451894914069b213595c316063
|
|
| BLAKE2b-256 |
48d184347688f07b766ba1eb00a7f4bac2b8b58cc1e4accfb24538bf4abf8e24
|
File details
Details for the file pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: pdf_strings-0.1.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
943014022619f99df05a8cbaf6ebfb59b475b73f029ac578fc0753a2e22289ee
|
|
| MD5 |
c09441a3ea189ea6e2d51396ad7beff3
|
|
| BLAKE2b-256 |
1cc2d42c84748ac64d6dd5b4cc08478c0f70b507cdd9a2f440f557c5f0120366
|