A Python library for business PDF related content analysis

These details have not been verified by PyPI

Project links

Homepage

Project description

Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs

Smart PDF for Business is a Python library for structured PDF content analysis, built on top of spaCy-layout. Using spaCy-layout’s page, block, and layout-aware text extraction capabilities, the library adds higher-level features for business workflows and document automation. Smart PDF for Business provides a unified interface to extract, search, and analyze PDF documents with structure-aware intelligence:

Load PDFs from files, folders, or raw bytes.
Extract headings, body text, sections, and layout-aware content.
Search text using keywords or semantic similarity powered by SentenceTransformers.
Detect handwritten or scanned signatures with bounding boxes, pages, and optional cropped images.
Export results as plain text, Markdown, CSV, Excel, or JSON.
Process multiple PDFs at once with batch utilities.

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install smart_pdf_for_business

After initialising a PDFDoc object using one of the factory methods, you can call its built-in functions. Most methods create a new PDFDoc object containing the result. This allows you to utilise the library's output capabilities.

from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic")

# Export results
clauses.to_csv("output.csv")
clauses.to_excel("output.xlsx")

Alternatively, you can use as_tuple=True to return the result as a tuple.

from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic", as_tuple=True)

# Print result as tuple
print(clauses)

📚 API

`class` PDFDoc

Attribute	Description
`data: bytes`	Raw byte content of the PDF document.
`path: Optional[Path]`	File system path to the PDF file, if loaded from disk.
`name: Optional[str]`	Name of the PDF document (usually the filename).
`spacy_doc: Optional[Doc]`	Processed spaCy Doc object, containing text, layout, and markdown.
`sections: List[Tuple[str, str]]`	List of `(title, body)` tuples extracted from the document.
`weighted_sections: List[Tuple[Tuple[str, str], float]]`	List of sections paired with a semantic match weight.
`signatures: List[Signature]`	Detected signatures stored as `Signature` objects.

Methods	Description	Parameter
`from_file`	Creates a `PDFDoc` from a PDF file. Return: `PDFDoc`	`path: str \| Path` - Path to the PDF file
`from_folder`	Loads all PDFs from a folder (optionally recursively). Return: `list[PDFDoc]`	`folder_path: str \| Path` - Folder containing PDFs `recursive: bool = True` - Include subfolders
`from_bytes`	Creates a `PDFDoc` from raw PDF byte content. Return: `PDFDoc`	`data: bytes` - PDF bytes `name: Optional[str] = None` - Name assigned to document
`from_byte_list`	Creates multiple `PDFDoc` objects from PDF byte streams. Return: `list[PDFDoc]`	`byte_list: List[bytes]` - List of PDF byte streams `names: Optional[List[str]] = None` - Corresponding PDF names
`search_signature`	Detects signatures near keyword regions and optionally saves cropped images. Updates `self.signatures`. Return: `PDFDoc`	`keywords: list[str]` - Signature-related keywords `exact=False` - Forbid substring matching `max_distance=70` - Max pixel distance `save_folder=None` - Folder to save crops `min_contrast=30` - Minimum image contrast `filter_stroke_density=True` - Stroke-density filtering `enforce_text_type: bool = True` - Filter signature text using nlp
`search_header`	Searches section headers for keyword matches. Return: `PDFDoc` or `list[tuple[str, str]]`	`keywords: list[str]` - Header keywords `exact: bool = True` - Exact match `as_tuple: bool = False` - Return tuples
`search_body`	Searches section bodies for keyword matches. Return: `PDFDoc` or `list[tuple[str, str]]`	`keywords: list[str]` - Body keywords `exact: bool = True` - Exact match `as_tuple: bool = False` - Return tuples
`search_by_meaning`	Performs semantic search across sections, sentences, paragraphs, or words. Return: `PDFDoc` or `list[tuple[str, float]]`	`query: str` - Search query `threshold: float = 0` - Minimum similarity `chunk_by: str = "section"` - Chunking strategy `as_tuple: bool = False` - Return raw tuples `word_count: int = 5` - Words per chunk `buffer_size: int = 1` - Chunk overlap `chunk_text_semantically_threshold: float = 0.3` - Semantic chunk threshold `max_results: int = 5` - Max results
`to_text`	Converts the PDF to plain text with optional signature annotations. Return: `str`	`annotate_signatures: bool = True` - Annotate signature positions
`to_markdown`	Converts the PDF to Markdown with optional signature annotations. Return: `str`	`annotate_signatures: bool = True` - Annotate signature positions
`to_dataframe`	Converts the document to a structured pandas DataFrame. Return: `pandas.DataFrame`	None
`to_excel`	Exports the document DataFrame to an Excel (.xlsx) file. Return: `None`	`path: str \| Path` - Output .xlsx path `sheet_name: str = "Sheet1"` - Sheet name
`to_csv`	Exports the document DataFrame to a CSV file. Return: `None`	`path: str \| Path` - Output CSV file path
`to_json`	Serializes the document into JSON, optionally writing to a file. Return: `str` (JSON) or `None` (if saved to file)	`path: Optional[str \| Path] = None` - Optional save path `indent: int = 2` - JSON formatting

`class` PDFDocBatch

Attribute	Description
`pdfdocs: list[PDFDoc]`	List containing all `PDFDoc` instances in the batch.

Methods	Description	Parameter
`from_folder`	Creates a batch from all PDFs in a folder. Return: `PDFDocBatch`	`path: str \| Path` - Path to the folder containing PDFs `recursive: bool = True` - Whether to include subfolders
`from_byte_list`	Creates a batch from a list of PDF byte streams. Return: `PDFDocBatch`	`byte_list: List[bytes]` - List of PDF bytes `names: Optional[List[str]] = None` - Optional list of names corresponding to each PDF
`from_pdfdoc_list`	Creates a batch from an existing list of `PDFDoc` objects. Return: `PDFDocBatch`	`pdfdoc_list: List[PDFDoc]` - List of `PDFDoc` instances
`extend`	Extends the batch with another batch or list of PDFs. Return: `None`	`other: Union[List[PDFDoc], PDFDocBatch]` - Batch or list of PDFs to append
`append`	Appends a single `PDFDoc` to the batch. Return: `None`	`pdfdoc: PDFDoc` - `PDFDoc` instance to append
`search_signature`	Detects handwritten or scanned signatures associated with keyword regions for multiple PDFs. Return: `PDFDocBatch`	`keywords: List[str]` - Keywords used to locate signature-related regions `exact: bool = False` - Whether keyword matching must be exact `max_distance: int = 70` - Maximum pixel distance between a keyword block and signature `save_folder: Optional[str \| Path] = None` - Folder to save cropped signature images, if provided `min_contrast: int = 30` - Minimum pixel contrast threshold `filter_stroke_density: bool = True` - Filter image regions based on stroke density `enforce_text_type: bool = True` - Filter signature text based on stroke density
`search_header`	Searches headers in every PDF for keywords. Return: `PDFDocBatch`	`keywords: list[str]` - Header keywords `exact: bool = True` - Exact or partial matching
`search_body`	Searches body text in every PDF for keywords. Return: `PDFDocBatch`	`keywords: list[str]` - Body keywords `exact: bool = True` - Exact or partial matching
`search_by_meaning`	Performs semantic search across all PDFs. Return: `PDFDocBatch`	`query: str` - Natural language query `threshold: float = 0` - Min similarity score `chunk_by: str = "section"` - Chunking method `word_count: int = 5` - Words per chunk `buffer_size: int = 1` - Overlap size `chunk_text_semantically_threshold: float = 0.3` - Semantic similarity threshold `max_results: int = 5` - Max results per PDF
`to_dataframe`	Combines all PDFs in the batch into a DataFrame. PDFs with the same name overwrite each other. Return: `pandas.DataFrame`	None
`to_excel`	Exports batch data to an Excel file. Return: `None`	`path: str \| Path` - Output Excel path `sheet_name: str = "Sheet1"` - Sheet name
`to_csv`	Exports batch data to a CSV file. Return: `None`	`path: str \| Path` - Output CSV file path

`dataclass` Signature

Attribute	Description
`keyword`	Keyword used for search
`text`	OCR text of signature region
`bbox`	Bounding box coordinates (x1, y1, x2, y2)
`page`	Page number
`img`	Cropped image as `numpy.ndarray`
`distance`	Distance to keyword

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Nov 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_pdf_for_business-1.0.0.tar.gz (11.0 MB view details)

Uploaded Nov 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smart_pdf_for_business-1.0.0-py3-none-any.whl (20.9 kB view details)

Uploaded Nov 16, 2025 Python 3

File details

Details for the file smart_pdf_for_business-1.0.0.tar.gz.

File metadata

Download URL: smart_pdf_for_business-1.0.0.tar.gz
Upload date: Nov 16, 2025
Size: 11.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for smart_pdf_for_business-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`52966d45d7777a4fccdc66b0ed85a14ba35dbe27b3d4b93182e50fb0066dbf1d`
MD5	`68c8a4b6e0ff66e567151002b9705a12`
BLAKE2b-256	`350178c088c62cbfd935bd55f31f6be1b3e32f7282beb744e6b140c76ca57f82`

See more details on using hashes here.

File details

Details for the file smart_pdf_for_business-1.0.0-py3-none-any.whl.

File metadata

Download URL: smart_pdf_for_business-1.0.0-py3-none-any.whl
Upload date: Nov 16, 2025
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for smart_pdf_for_business-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c61c7079ab5d490a013a98a39a834b509b4a5bef82b3b7c523de87756c6ea49a`
MD5	`ec09cc029c32274802f1ed361b12a3ae`
BLAKE2b-256	`5a6dab116909d0f70efa24c0099c2453cb982aa9b186d9849bb7f4412ca3b1de`

See more details on using hashes here.

smart-pdf-for-business 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs

📝 Usage

📚 API

`class` PDFDoc

`class` PDFDocBatch

`dataclass` Signature

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

smart-pdf-for-business 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs

📝 Usage

📚 API

class PDFDoc

class PDFDocBatch

dataclass Signature

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`class` PDFDoc

`class` PDFDocBatch

`dataclass` Signature