A Python library for business PDF related content analysis
Project description
Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs
Smart PDF for Business is a Python library for structured PDF content analysis, built on top of spaCy-layout. Using spaCy-layout’s page, block, and layout-aware text extraction capabilities, the library adds higher-level features for business workflows and document automation. Smart PDF for Business provides a unified interface to extract, search, and analyze PDF documents with structure-aware intelligence:
- Load PDFs from files, folders, or raw bytes.
- Extract headings, body text, sections, and layout-aware content.
- Search text using keywords or semantic similarity powered by SentenceTransformers.
- Detect handwritten or scanned signatures with bounding boxes, pages, and optional cropped images.
- Export results as plain text, Markdown, CSV, Excel, or JSON.
- Process multiple PDFs at once with batch utilities.
📝 Usage
⚠️ This package requires Python 3.10 or above.
pip install smart_pdf_for_business
After initialising a PDFDoc object using one of the factory methods, you can call its built-in functions. Most methods create a new PDFDoc object containing the result. This allows you to utilise the library's output capabilities.
from smart_pdf_for_business import PDFDoc
# Load a PDF
pdf = PDFDoc.from_file("example.pdf")
# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic")
# Export results
clauses.to_csv("output.csv")
clauses.to_excel("output.xlsx")
Alternatively, you can use as_tuple=True to return the result as a tuple.
from smart_pdf_for_business import PDFDoc
# Load a PDF
pdf = PDFDoc.from_file("example.pdf")
# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic", as_tuple=True)
# Print result as tuple
print(clauses)
📚 API
class PDFDoc
| Attribute | Description |
|---|---|
data: bytes |
Raw byte content of the PDF document. |
path: Optional[Path] |
File system path to the PDF file, if loaded from disk. |
name: Optional[str] |
Name of the PDF document (usually the filename). |
spacy_doc: Optional[Doc] |
Processed spaCy Doc object, containing text, layout, and markdown. |
sections: List[Tuple[str, str]] |
List of (title, body) tuples extracted from the document. |
weighted_sections: List[Tuple[Tuple[str, str], float]] |
List of sections paired with a semantic match weight. |
signatures: List[Signature] |
Detected signatures stored as Signature objects. |
| Methods | Description | Parameter |
|---|---|---|
from_file |
Creates a PDFDoc from a PDF file. Return: PDFDoc |
path: str | Path - Path to the PDF file |
from_folder |
Loads all PDFs from a folder (optionally recursively). Return: list[PDFDoc] |
folder_path: str | Path - Folder containing PDFs recursive: bool = True - Include subfolders |
from_bytes |
Creates a PDFDoc from raw PDF byte content. Return: PDFDoc |
data: bytes - PDF bytes name: Optional[str] = None - Name assigned to document |
from_byte_list |
Creates multiple PDFDoc objects from PDF byte streams. Return: list[PDFDoc] |
byte_list: List[bytes] - List of PDF byte streams names: Optional[List[str]] = None - Corresponding PDF names |
search_signature |
Detects signatures near keyword regions and optionally saves cropped images. Updates self.signatures. Return: PDFDoc |
keywords: list[str] - Signature-related keywords exact=False - Forbid substring matching max_distance=70 - Max pixel distance save_folder=None - Folder to save crops min_contrast=30 - Minimum image contrast filter_stroke_density=True - Stroke-density filtering enforce_text_type: bool = True - Filter signature text using nlp |
search_header |
Searches section headers for keyword matches. Return: PDFDoc or list[tuple[str, str]] |
keywords: list[str] - Header keywords exact: bool = True - Exact match as_tuple: bool = False - Return tuples |
search_body |
Searches section bodies for keyword matches. Return: PDFDoc or list[tuple[str, str]] |
keywords: list[str] - Body keywords exact: bool = True - Exact match as_tuple: bool = False - Return tuples |
search_by_meaning |
Performs semantic search across sections, sentences, paragraphs, or words. Return: PDFDoc or list[tuple[str, float]] |
query: str - Search query threshold: float = 0 - Minimum similarity chunk_by: str = "section" - Chunking strategy as_tuple: bool = False - Return raw tuples word_count: int = 5 - Words per chunk buffer_size: int = 1 - Chunk overlap chunk_text_semantically_threshold: float = 0.3 - Semantic chunk threshold max_results: int = 5 - Max results |
to_text |
Converts the PDF to plain text with optional signature annotations. Return: str |
annotate_signatures: bool = True - Annotate signature positions |
to_markdown |
Converts the PDF to Markdown with optional signature annotations. Return: str |
annotate_signatures: bool = True - Annotate signature positions |
to_dataframe |
Converts the document to a structured pandas DataFrame. Return: pandas.DataFrame |
None |
to_excel |
Exports the document DataFrame to an Excel (.xlsx) file. Return: None |
path: str | Path - Output .xlsx path sheet_name: str = "Sheet1" - Sheet name |
to_csv |
Exports the document DataFrame to a CSV file. Return: None |
path: str | Path - Output CSV file path |
to_json |
Serializes the document into JSON, optionally writing to a file. Return: str (JSON) or None (if saved to file) |
path: Optional[str | Path] = None - Optional save path indent: int = 2 - JSON formatting |
class PDFDocBatch
| Attribute | Description |
|---|---|
pdfdocs: list[PDFDoc] |
List containing all PDFDoc instances in the batch. |
| Methods | Description | Parameter |
|---|---|---|
from_folder |
Creates a batch from all PDFs in a folder. Return: PDFDocBatch |
path: str | Path - Path to the folder containing PDFs recursive: bool = True - Whether to include subfolders |
from_byte_list |
Creates a batch from a list of PDF byte streams. Return: PDFDocBatch |
byte_list: List[bytes] - List of PDF bytes names: Optional[List[str]] = None - Optional list of names corresponding to each PDF |
from_pdfdoc_list |
Creates a batch from an existing list of PDFDoc objects. Return: PDFDocBatch |
pdfdoc_list: List[PDFDoc] - List of PDFDoc instances |
extend |
Extends the batch with another batch or list of PDFs. Return: None |
other: Union[List[PDFDoc], PDFDocBatch] - Batch or list of PDFs to append |
append |
Appends a single PDFDoc to the batch. Return: None |
pdfdoc: PDFDoc - PDFDoc instance to append |
search_signature |
Detects handwritten or scanned signatures associated with keyword regions for multiple PDFs. Return: PDFDocBatch |
keywords: List[str] - Keywords used to locate signature-related regions exact: bool = False - Whether keyword matching must be exact max_distance: int = 70 - Maximum pixel distance between a keyword block and signature save_folder: Optional[str | Path] = None - Folder to save cropped signature images, if provided min_contrast: int = 30 - Minimum pixel contrast threshold filter_stroke_density: bool = True - Filter image regions based on stroke density enforce_text_type: bool = True - Filter signature text based on stroke density |
search_header |
Searches headers in every PDF for keywords. Return: PDFDocBatch |
keywords: list[str] - Header keywords exact: bool = True - Exact or partial matching |
search_body |
Searches body text in every PDF for keywords. Return: PDFDocBatch |
keywords: list[str] - Body keywords exact: bool = True - Exact or partial matching |
search_by_meaning |
Performs semantic search across all PDFs. Return: PDFDocBatch |
query: str - Natural language query threshold: float = 0 - Min similarity score chunk_by: str = "section" - Chunking method word_count: int = 5 - Words per chunk buffer_size: int = 1 - Overlap size chunk_text_semantically_threshold: float = 0.3 - Semantic similarity threshold max_results: int = 5 - Max results per PDF |
to_dataframe |
Combines all PDFs in the batch into a DataFrame. PDFs with the same name overwrite each other. Return: pandas.DataFrame |
None |
to_excel |
Exports batch data to an Excel file. Return: None |
path: str | Path - Output Excel path sheet_name: str = "Sheet1" - Sheet name |
to_csv |
Exports batch data to a CSV file. Return: None |
path: str | Path - Output CSV file path |
dataclass Signature
| Attribute | Description |
|---|---|
keyword |
Keyword used for search |
text |
OCR text of signature region |
bbox |
Bounding box coordinates (x1, y1, x2, y2) |
page |
Page number |
img |
Cropped image as numpy.ndarray |
distance |
Distance to keyword |
Contributing
Contributions are welcome! Please fork the repository and submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smart_pdf_for_business-1.0.0.tar.gz.
File metadata
- Download URL: smart_pdf_for_business-1.0.0.tar.gz
- Upload date:
- Size: 11.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52966d45d7777a4fccdc66b0ed85a14ba35dbe27b3d4b93182e50fb0066dbf1d
|
|
| MD5 |
68c8a4b6e0ff66e567151002b9705a12
|
|
| BLAKE2b-256 |
350178c088c62cbfd935bd55f31f6be1b3e32f7282beb744e6b140c76ca57f82
|
File details
Details for the file smart_pdf_for_business-1.0.0-py3-none-any.whl.
File metadata
- Download URL: smart_pdf_for_business-1.0.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c61c7079ab5d490a013a98a39a834b509b4a5bef82b3b7c523de87756c6ea49a
|
|
| MD5 |
ec09cc029c32274802f1ed361b12a3ae
|
|
| BLAKE2b-256 |
5a6dab116909d0f70efa24c0099c2453cb982aa9b186d9849bb7f4412ca3b1de
|