Python tools for PDF automation.
Project description
What is PDFlex?
PDFlex is a powerful PDF processing toolkit for Python. It provides robust tools for PDF validation, text extraction, merging (with custom separator pages), searching, and more—all built to streamline your PDF automation workflows.
Features
- PDF Validation: Quickly verify if a file is a valid PDF.
- Text Extraction: Extract text from PDFs using either PyMuPDF or PyPDF.
- Directory Processing: Process entire directories of PDFs for text extraction.
- PDF Merging: Merge multiple PDF files into one, automatically inserting a custom separator page between documents.
- The separator page displays the title (derived from the filename) with underscores and hyphens removed.
- Supports both portrait and landscape separator pages (ideal for lecture slides).
- PDF Searching: Recursively search for PDFs in a directory based on filename patterns (e.g., numeric float prefixes).
Quick Start
Installation
PDFlex is available on PyPI. To install using pip:
pip install -U pdflex
Alternatively, install in an isolated environment with pipx:
pipx install pdflex
For the fastest installation using uv:
uv tool install pdflex
Usage
Command-Line Interface (CLI)
PDFlex provides a convenient CLI for merging and searching PDFs. The CLI supports two primary commands: merge and search.
Merge Command
Merge multiple PDF files into a single document while automatically inserting a separator page before each document.
Usage:
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf
Add the --landscape flag to create separator pages in landscape orientation:
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf --landscape
Search and Merge Command
Search for PDF files in a directory based on filename filters (or search for lecture slides with numeric float prefixes) and merge them into one PDF.
Usage:
-
General Search:
pdflex search /path/to/search -o merged_output.pdf --prefix "Chapter" --suffix ".pdf"
-
Lecture Slides Merge: (Merges all PDFs whose filenames start with a numeric float prefix like
1.2_,3.2_, etc., in sorted order. Separator pages will be in landscape orientation.)pdflex search /path/to/algorithms-and-computation -o merged_lectures.pdf --lecture
Python API Usage
You can also use PDFlex directly from your Python code. Below are examples for some common tasks.
Merging PDFs with Separator Pages
from pathlib import Path
from pdflex.merge import merge_pdfs
# List of PDF file paths to merge
pdf_files = [
"/path/to/document1.pdf",
"/path/to/document2.pdf"
]
# Merge files, using landscape separator pages (ideal for lecture slides)
merge_pdfs(pdf_files, output_path="merged_output.pdf", landscape=True)
Searching for PDFs by Filename
from pdflex.search import search_pdfs, search_numeric_prefixed_pdfs
# General search: Find PDFs that start with a prefix and/or end with a suffix
pdf_list = search_pdfs("/path/to/search", prefix="Chapter", suffix=".pdf")
print("Found PDFs:", pdf_list)
# Lecture slides: Find PDFs with numeric float prefixes (e.g., "1.2_Intro.pdf")
lecture_slides = search_numeric_prefixed_pdfs("/path/to/algorithms-and-computation")
print("Found lecture slides:", lecture_slides)
Contributing
Contributions are welcome! Whether it's bug reports, feature requests, or code contributions, please feel free to:
- Open an issue
- Submit a pull request
- Improve documentation.
- Share your ideas!
Acknowledgments
This project is built upon several awesome PDF open-source projects:
License
PDFlex is released under the MIT license.
Copyright (c) 2020 to present PDFlex and contributors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdflex-0.1.6.tar.gz.
File metadata
- Download URL: pdflex-0.1.6.tar.gz
- Upload date:
- Size: 304.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4531cad663949cf19fb5dde4750971a1c8b5bf1502f2c19fbded3bf5b47cd07
|
|
| MD5 |
e9b9bfaf754310d02575627bff007b5f
|
|
| BLAKE2b-256 |
6f984637a64c50e0fd3f29a60bcf8d6f54ffa050a4a443e1775d78e10fb0b210
|
Provenance
The following attestation bundles were made for pdflex-0.1.6.tar.gz:
Publisher:
ci.yml on eli64s/pdflex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdflex-0.1.6.tar.gz -
Subject digest:
a4531cad663949cf19fb5dde4750971a1c8b5bf1502f2c19fbded3bf5b47cd07 - Sigstore transparency entry: 172181854
- Sigstore integration time:
-
Permalink:
eli64s/pdflex@1ae2a37caa71b360c7f11df247b83a7987108627 -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/eli64s
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@1ae2a37caa71b360c7f11df247b83a7987108627 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pdflex-0.1.6-py3-none-any.whl.
File metadata
- Download URL: pdflex-0.1.6-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5aebad5c2b90cff119f5445c7643783b1e193d4daf981597c40c692ca391238
|
|
| MD5 |
cc819954eb87ce7cee3fd32bbc42a53a
|
|
| BLAKE2b-256 |
a77d86f0829b57f62882a18a6414754cc95d387145c3cb565fef7fb8cd6c2d31
|
Provenance
The following attestation bundles were made for pdflex-0.1.6-py3-none-any.whl:
Publisher:
ci.yml on eli64s/pdflex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdflex-0.1.6-py3-none-any.whl -
Subject digest:
d5aebad5c2b90cff119f5445c7643783b1e193d4daf981597c40c692ca391238 - Sigstore transparency entry: 172181857
- Sigstore integration time:
-
Permalink:
eli64s/pdflex@1ae2a37caa71b360c7f11df247b83a7987108627 -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/eli64s
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@1ae2a37caa71b360c7f11df247b83a7987108627 -
Trigger Event:
push
-
Statement type: