Skip to main content

A utility package for PDF processing, including splitting, merging, and page counting

Project description

HJIMI PDF Processor

PDF Processor is a powerful PDF file processing toolkit that provides various PDF file manipulation functions.

Main Features

  1. PDF File Splitting

    • Split by file size
    • Split by page count
    • Split by bookmarks (supports first-level bookmarks)
  2. PDF File Merging

    • Support merging multiple PDF files
    • Maintain original page content and format
    • Error handling and logging
  3. PDF File Information

    • Get total page count
    • Filename normalization

Features

  • Easy to use: Provides intuitive static method interfaces
  • Flexible configuration: Supports custom split sizes and page counts
  • Error handling: Comprehensive exception handling and error messages
  • File safety: Automatic temporary file cleanup

Requirements

  • Python 3.8 or higher
  • PyPDF2 3.0.0 or higher

Installation

pip install hjimi-pdf-processor

Import

from hjimi_pdf_processor import PDFProcessor

Usage Examples

1. Get PDF Page Count

# Get single file page count
page_count = PDFProcessor.get_pdf_page_count("document.pdf")
print(f"PDF pages: {page_count}")

# Get multiple file page counts
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for file in pdf_files:
    count = PDFProcessor.get_pdf_page_count(file)
    print(f"{file} pages: {count}")

2. Split PDF Files

# Split by page count
PDFProcessor.split_pdf_by_pages("large_doc.pdf", pages_per_split=10)

# Split by file size (in KB)
PDFProcessor.split_pdf_by_size("large_doc.pdf", max_size_kb=1024)

# Split by bookmarks
PDFProcessor.split_pdf_by_bookmarks("book.pdf")

3. Merge PDF Files

# Merge multiple PDF files
pdf_files = ["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"]
PDFProcessor.merge_pdfs(pdf_files, "merged_document.pdf")

Use Cases

  1. File Splitting

    • Split large PDF files for easier transmission
    • Split textbooks or documents by chapters (bookmarks)
    • Split documents by fixed page count for printing
  2. File Merging

    • Merge multiple scanned documents
    • Combine report or article sections
    • Integrate multiple PDF files into a single document
  3. File Processing

    • Batch retrieve PDF file information
    • Normalize PDF filenames
    • Control PDF file sizes

API Documentation

PDFProcessor Class Methods

1. sanitize_filename(filename: str) -> str

Cleans illegal characters from filenames.

  • Parameters:
    • filename: Original filename
  • Returns: Cleaned legal filename
  • Usage: Handles filenames containing special characters, replacing illegal characters with underscores

2. get_pdf_page_count(file_path: str) -> int

Gets the total page count of a PDF file.

  • Parameters:
    • file_path: PDF file path
  • Returns: Total PDF pages, None if error occurs
  • Exception Handling: Catches and prints file reading errors

3. split_pdf_by_size(input_file: str, max_size_kb: int) -> None

Splits PDF file by size.

  • Parameters:
    • input_file: Input PDF file path
    • max_size_kb: Maximum size for each split file (KB)
  • Output Format: original_filename_part_number.pdf
  • Features: Auto-cleans temporary files, displays real-time progress

4. split_pdf_by_pages(input_file: str, pages_per_split: int) -> None

Splits PDF file by page count.

  • Parameters:
    • input_file: Input PDF file path
    • pages_per_split: Pages per split file
  • Output Format: original_filename_part_number.pdf
  • Features: Shows split progress and page ranges

5. split_pdf_by_bookmarks(input_file: str) -> None

Splits PDF file by first-level bookmarks.

  • Parameters:
    • input_file: Input PDF file path
  • Output Format: original_filename_part_number_bookmark_name.pdf
  • Limitations: Only supports first-level bookmark splitting
  • Features: Automatically handles illegal characters in bookmark names

6. merge_pdfs(pdf_files: List[str], output_path: str) -> None

Merges multiple PDF files.

  • Parameters:
    • pdf_files: List of PDF file paths
    • output_path: Output file path
  • Features:
    • Maintains original page content and format
    • Single file failure doesn't affect overall merge
    • Detailed error logging

Notes

  1. File Operations

    • Ensure sufficient disk space
    • Keep original file backups
    • Be aware of filename conflicts
  2. Performance Considerations

    • Large file processing may take time
    • Test with small files first
    • Monitor memory usage
  3. Limitations

    • Does not support encrypted PDF files
    • Only supports first-level bookmark splitting
    • Some special PDF formats may not be compatible

License

MIT License

Contact

Contributing

We welcome issue reports and feature suggestions. To contribute code:

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Ensure all tests pass
  5. Submit a pull request

Changelog

v0.1.2

  • Fixed package structure and import path
  • Renamed source directory to match package name
  • Updated project configuration for correct package naming

v0.1.1

  • Fixed package import statement in documentation
  • Updated package name to hjimi_pdf_processor
  • Improved documentation structure

v0.1.0

  • Initial release
  • Implemented basic PDF splitting and merging functions
  • Added file information retrieval features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hjimi_pdf_processor-0.1.2.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hjimi_pdf_processor-0.1.2-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file hjimi_pdf_processor-0.1.2.tar.gz.

File metadata

  • Download URL: hjimi_pdf_processor-0.1.2.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for hjimi_pdf_processor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2ebad6c461990208fa5d86be5ec2bfaf843874ed523f2d3a2999927930d72e2a
MD5 12ff893e0444a5c09e74be684fc44bb1
BLAKE2b-256 cd8305bea8b7118ef82f77328bd8bc0fc407c107b7bdeff7040e041844a21717

See more details on using hashes here.

File details

Details for the file hjimi_pdf_processor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for hjimi_pdf_processor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8d8869081dcb5ef683e448fab6648c5d37b3cefad540d94715c1ceb2082495e3
MD5 9117734458a3ac7104c42ee112485c86
BLAKE2b-256 885a31e4cfa2c972f17161ad13e7b97d82ad98d9f31df5ed5e841a2e3fdae7b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page