Skip to main content

Data Object Layer for PDF data

Project description

pdfdol

Data Object Layer for PDF data

To install: pip install pdfdol

Documentation

Examples

Pdf "Stores"

Get a dict-like object to list and read the pdfs of a folder, as text:

>>> from pdfdol import PdfFilesReader
>>> from pdfdol.tests import get_test_pdf_folder
>>> folder_path = get_test_pdf_folder()
>>> pdfs = PdfFilesReader(folder_path)
>>> sorted(pdfs)
['sample_pdf_1', 'sample_pdf_2']
>>> assert pdfs['sample_pdf_2'] == [
...     'Page 1\nThis is a sample text for testing Python PDF tools.'
... ]

See that the values of a PdfFilesReader are lists of pages. If you need strings (i.e. all the pages together) you can add a decoder like so:

from dol import add_decoder
page_separator = '---------------------'
pdfs = add_decoder(pdfs, decoder=page_separator.join)

If you need this at the level of the class, just do this:

from dol import add_decoder
page_separator = '---------------------'
FilesReader = add_decoder(PdfFilesReader, decoder=page_separator.join)
# and then
pdfs = FilesReader(folder_path)
# ...

If you need to concatinate a bunch of pdfs together, you can do so in many ways. Here's one:

from dol import Files
from pdfdol import concat_pdfs

s = Files('~/Downloads/cosmograph_documentation_pdfs/')
concat_pdfs(s, key_order=sorted)

Converting ebooks and documents to PDF (optional Calibre integration)

pdfdol natively converts images, HTML, and Markdown to PDF. For additional formats -- EPUB, MOBI, DOCX, ODT, DJVU, RTF, and many more -- install Calibre, which provides the ebook-convert command-line tool.

pdfdol does not depend on Calibre; it auto-detects the tool at runtime and uses it only for formats that have no built-in converter.

Installing ebook-convert:

Platform Command
macOS (Homebrew) brew install --cask calibre
Debian / Ubuntu sudo apt install calibre
Fedora / RHEL sudo dnf install calibre
Linux servers (headless) sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.sh | sudo sh /dev/stdin

The Linux CLI installer is self-contained (no GUI/X11 needed) and recommended for servers. See https://calibre-ebook.com/download for all options.

from pdfdol import ebook_convert_to_pdf, find_ebook_convert

# Check whether Calibre is available
if find_ebook_convert():
    pdf_bytes = ebook_convert_to_pdf("book.epub")

You can also go through the usual get_pdf entry point -- it will automatically route to ebook-convert when it recognises the file extension:

from pdfdol import get_pdf
pdf_bytes = get_pdf("book.epub")                     # returns PDF bytes
get_pdf("book.epub", egress="book.pdf")              # saves to file

Custom converters

pdfdol maintains a format converter registry that maps file extensions to converter functions. You can register your own:

from pdfdol import register_format_converter, supported_extensions

def my_custom_converter(source):
    """source is a filepath (str) or raw bytes; must return PDF bytes."""
    ...

register_format_converter('.xyz', my_custom_converter)

# See everything that's currently supported
print(supported_extensions())

Get pdf from various sources

Example with a URL

pdf_data = get_pdf("https://pypi.org", src_kind="url")
print("Got PDF data of length:", len(pdf_data))

Example with HTML content

html_content = "<html><body><h1>Hello, PDF!</h1></body></html>"
pdf_data = get_pdf(html_content, src_kind="html")
print("Got PDF data of length:", len(pdf_data))

Example saving to file

filepath = get_pdf("https://pypi.org", egress="output.pdf", src_kind="url")
print("PDF saved to:", filepath)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdol-0.1.25.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdol-0.1.25-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file pdfdol-0.1.25.tar.gz.

File metadata

  • Download URL: pdfdol-0.1.25.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.25.tar.gz
Algorithm Hash digest
SHA256 fc786607c5e9f401307a4a3cc69eb7c13570c3557e3ad84b162335964156831b
MD5 9f803f85e86f164d74cdfa3b300bac73
BLAKE2b-256 58a7abc2f87fd202239d1b96ad2636a617190e86eb00479dc471377121fe55bd

See more details on using hashes here.

File details

Details for the file pdfdol-0.1.25-py3-none-any.whl.

File metadata

  • Download URL: pdfdol-0.1.25-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.25-py3-none-any.whl
Algorithm Hash digest
SHA256 7302c98f1404a4283ae580592017f770d184d903b86c32e1322ded643937128c
MD5 0ab978d22e72395c9b040c738769d3cf
BLAKE2b-256 fc7d21e5f38b11b76379c09dd9cc9278ffd104f793e7260c7eb604837148fb73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page