Skip to main content

A Python library for business PDF related content analysis

Project description

Smart PDF for Business: Extract Text, Headings, and Signatures from PDFs

Smart PDF for Business is a Python library for structured PDF content analysis, built on top of spaCy-layout. Using spaCy-layout’s page, block, and layout-aware text extraction capabilities, the library adds higher-level features for business workflows and document automation. Smart PDF for Business provides a unified interface to extract, search, and analyze PDF documents with structure-aware intelligence:

  • Load PDFs from files, folders, or raw bytes.
  • Extract headings, body text, sections, and layout-aware content.
  • Search text using keywords or semantic similarity powered by SentenceTransformers.
  • Detect handwritten or scanned signatures with bounding boxes, pages, and optional cropped images.
  • Export results as plain text, Markdown, CSV, Excel, or JSON.
  • Process multiple PDFs at once with batch utilities.

PyPI Version Python Version License

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install smart_pdf_for_business

After initialising a PDFDoc object using one of the factory methods, you can call its built-in functions. Most methods create a new PDFDoc object containing the result. This allows you to utilise the library's output capabilities.

from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic")

# Export results
clauses.to_csv("output.csv")
clauses.to_excel("output.xlsx")

Alternatively, you can use as_tuple=True to return the result as a tuple.

from smart_pdf_for_business import PDFDoc

# Load a PDF
pdf = PDFDoc.from_file("example.pdf")

# Perform semantic search
clauses = pdf.search_by_meaning("termination clause", chunk_by="semantic", as_tuple=True)

# Print result as tuple
print(clauses)

📚 API

class PDFDoc

Attribute Description
data: bytes Raw byte content of the PDF document.
path: Optional[Path] File system path to the PDF file, if loaded from disk.
name: Optional[str] Name of the PDF document (usually the filename).
spacy_doc: Optional[Doc] Processed spaCy Doc object, containing text, layout, and markdown.
sections: List[Tuple[str, str]] List of (title, body) tuples extracted from the document.
weighted_sections: List[Tuple[Tuple[str, str], float]] List of sections paired with a semantic match weight.
signatures: List[Signature] Detected signatures stored as Signature objects.

Methods Description Parameter
from_file Creates a PDFDoc from a PDF file.
Return: PDFDoc
path: str | Path
- Path to the PDF file
from_folder Loads all PDFs from a folder (optionally recursively).
Return: list[PDFDoc]
folder_path: str | Path
- Folder containing PDFs
recursive: bool = True
- Include subfolders
from_bytes Creates a PDFDoc from raw PDF byte content.
Return: PDFDoc
data: bytes
- PDF bytes
name: Optional[str] = None
- Name assigned to document
from_byte_list Creates multiple PDFDoc objects from PDF byte streams.
Return: list[PDFDoc]
byte_list: List[bytes]
- List of PDF byte streams
names: Optional[List[str]] = None
- Corresponding PDF names
search_signature Detects signatures near keyword regions and optionally saves cropped images. Updates self.signatures.
Return: PDFDoc
keywords: list[str]
- Signature-related keywords
exact=False
- Forbid substring matching
max_distance=70
- Max pixel distance
save_folder=None
- Folder to save crops
min_contrast=30
- Minimum image contrast
filter_stroke_density=True
- Stroke-density filtering
enforce_text_type: bool = True
- Filter signature text using nlp
search_header Searches section headers for keyword matches.
Return: PDFDoc or list[tuple[str, str]]
keywords: list[str]
- Header keywords
exact: bool = True
- Exact match
as_tuple: bool = False
- Return tuples
search_body Searches section bodies for keyword matches.
Return: PDFDoc or list[tuple[str, str]]
keywords: list[str]
- Body keywords
exact: bool = True
- Exact match
as_tuple: bool = False
- Return tuples
search_by_meaning Performs semantic search across sections, sentences, paragraphs, or words.
Return: PDFDoc or list[tuple[str, float]]
query: str
- Search query
threshold: float = 0
- Minimum similarity
chunk_by: str = "section"
- Chunking strategy
as_tuple: bool = False
- Return raw tuples
word_count: int = 5
- Words per chunk
buffer_size: int = 1
- Chunk overlap
chunk_text_semantically_threshold: float = 0.3
- Semantic chunk threshold
max_results: int = 5
- Max results
to_text Converts the PDF to plain text with optional signature annotations.
Return: str
annotate_signatures: bool = True
- Annotate signature positions
to_markdown Converts the PDF to Markdown with optional signature annotations.
Return: str
annotate_signatures: bool = True
- Annotate signature positions
to_dataframe Converts the document to a structured pandas DataFrame.
Return: pandas.DataFrame
None
to_excel Exports the document DataFrame to an Excel (.xlsx) file.
Return: None
path: str | Path
- Output .xlsx path
sheet_name: str = "Sheet1"
- Sheet name
to_csv Exports the document DataFrame to a CSV file.
Return: None
path: str | Path
- Output CSV file path
to_json Serializes the document into JSON, optionally writing to a file.
Return: str (JSON) or None (if saved to file)
path: Optional[str | Path] = None
- Optional save path
indent: int = 2
- JSON formatting

class PDFDocBatch

Attribute Description
pdfdocs: list[PDFDoc] List containing all PDFDoc instances in the batch.

Methods Description Parameter
from_folder Creates a batch from all PDFs in a folder.
Return: PDFDocBatch
path: str | Path
- Path to the folder containing PDFs
recursive: bool = True
- Whether to include subfolders
from_byte_list Creates a batch from a list of PDF byte streams.
Return: PDFDocBatch
byte_list: List[bytes]
- List of PDF bytes
names: Optional[List[str]] = None
- Optional list of names corresponding to each PDF
from_pdfdoc_list Creates a batch from an existing list of PDFDoc objects.
Return: PDFDocBatch
pdfdoc_list: List[PDFDoc]
- List of PDFDoc instances
extend Extends the batch with another batch or list of PDFs.
Return: None
other: Union[List[PDFDoc], PDFDocBatch]
- Batch or list of PDFs to append
append Appends a single PDFDoc to the batch.
Return: None
pdfdoc: PDFDoc
- PDFDoc instance to append
search_signature Detects handwritten or scanned signatures associated with keyword regions for multiple PDFs.
Return: PDFDocBatch
keywords: List[str]
- Keywords used to locate signature-related regions
exact: bool = False
- Whether keyword matching must be exact
max_distance: int = 70
- Maximum pixel distance between a keyword block and signature
save_folder: Optional[str | Path] = None
- Folder to save cropped signature images, if provided
min_contrast: int = 30
- Minimum pixel contrast threshold
filter_stroke_density: bool = True
- Filter image regions based on stroke density
enforce_text_type: bool = True
- Filter signature text based on stroke density
search_header Searches headers in every PDF for keywords.
Return: PDFDocBatch
keywords: list[str]
- Header keywords
exact: bool = True
- Exact or partial matching
search_body Searches body text in every PDF for keywords.
Return: PDFDocBatch
keywords: list[str]
- Body keywords
exact: bool = True
- Exact or partial matching
search_by_meaning Performs semantic search across all PDFs.
Return: PDFDocBatch
query: str
- Natural language query
threshold: float = 0
- Min similarity score
chunk_by: str = "section"
- Chunking method
word_count: int = 5
- Words per chunk
buffer_size: int = 1
- Overlap size
chunk_text_semantically_threshold: float = 0.3
- Semantic similarity threshold
max_results: int = 5
- Max results per PDF
to_dataframe Combines all PDFs in the batch into a DataFrame. PDFs with the same name overwrite each other.
Return: pandas.DataFrame
None
to_excel Exports batch data to an Excel file.
Return: None
path: str | Path
- Output Excel path
sheet_name: str = "Sheet1"
- Sheet name
to_csv Exports batch data to a CSV file.
Return: None
path: str | Path
- Output CSV file path

dataclass Signature

Attribute Description
keyword Keyword used for search
text OCR text of signature region
bbox Bounding box coordinates (x1, y1, x2, y2)
page Page number
img Cropped image as numpy.ndarray
distance Distance to keyword

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_pdf_for_business-1.0.0.tar.gz (11.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smart_pdf_for_business-1.0.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file smart_pdf_for_business-1.0.0.tar.gz.

File metadata

  • Download URL: smart_pdf_for_business-1.0.0.tar.gz
  • Upload date:
  • Size: 11.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for smart_pdf_for_business-1.0.0.tar.gz
Algorithm Hash digest
SHA256 52966d45d7777a4fccdc66b0ed85a14ba35dbe27b3d4b93182e50fb0066dbf1d
MD5 68c8a4b6e0ff66e567151002b9705a12
BLAKE2b-256 350178c088c62cbfd935bd55f31f6be1b3e32f7282beb744e6b140c76ca57f82

See more details on using hashes here.

File details

Details for the file smart_pdf_for_business-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smart_pdf_for_business-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c61c7079ab5d490a013a98a39a834b509b4a5bef82b3b7c523de87756c6ea49a
MD5 ec09cc029c32274802f1ed361b12a3ae
BLAKE2b-256 5a6dab116909d0f70efa24c0099c2453cb982aa9b186d9849bb7f4412ca3b1de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page