Skip to main content

Data Object Layer for PDF data

Project description

pdfdol

Data Object Layer for PDF data

To install: pip install pdfdol

Documentation

Examples

Pdf "Stores"

Get a dict-like object to list and read the pdfs of a folder, as text:

>>> from pdfdol import PdfFilesReader
>>> from pdfdol.tests import get_test_pdf_folder
>>> folder_path = get_test_pdf_folder()
>>> pdfs = PdfFilesReader(folder_path)
>>> sorted(pdfs)
['sample_pdf_1', 'sample_pdf_2']
>>> assert pdfs['sample_pdf_2'] == [
...     'Page 1\nThis is a sample text for testing Python PDF tools.'
... ]

See that the values of a PdfFilesReader are lists of pages. If you need strings (i.e. all the pages together) you can add a decoder like so:

from dol import add_decoder
page_separator = '---------------------'
pdfs = add_decoder(pdfs, decoder=page_separator.join)

If you need this at the level of the class, just do this:

from dol import add_decoder
page_separator = '---------------------'
FilesReader = add_decoder(PdfFilesReader, decoder=page_separator.join)
# and then
pdfs = FilesReader(folder_path)
# ...

If you need to concatinate a bunch of pdfs together, you can do so in many ways. Here's one:

from dol import Files
from pdfdol import concat_pdfs

s = Files('~/Downloads/cosmograph_documentation_pdfs/')
concat_pdfs(s, key_order=sorted)

Converting ebooks and documents to PDF (optional Calibre integration)

pdfdol natively converts images, HTML, and Markdown to PDF. For additional formats -- EPUB, MOBI, DOCX, ODT, DJVU, RTF, and many more -- install Calibre, which provides the ebook-convert command-line tool.

pdfdol does not depend on Calibre; it auto-detects the tool at runtime and uses it only for formats that have no built-in converter.

Installing ebook-convert:

Platform Command
macOS (Homebrew) brew install --cask calibre
Debian / Ubuntu sudo apt install calibre
Fedora / RHEL sudo dnf install calibre
Linux servers (headless) sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.sh | sudo sh /dev/stdin

The Linux CLI installer is self-contained (no GUI/X11 needed) and recommended for servers. See https://calibre-ebook.com/download for all options.

from pdfdol import ebook_convert_to_pdf, find_ebook_convert

# Check whether Calibre is available
if find_ebook_convert():
    pdf_bytes = ebook_convert_to_pdf("book.epub")

You can also go through the usual get_pdf entry point -- it will automatically route to ebook-convert when it recognises the file extension:

from pdfdol import get_pdf
pdf_bytes = get_pdf("book.epub")                     # returns PDF bytes
get_pdf("book.epub", egress="book.pdf")              # saves to file

Custom converters

pdfdol maintains a format converter registry that maps file extensions to converter functions. You can register your own:

from pdfdol import register_format_converter, supported_extensions

def my_custom_converter(source):
    """source is a filepath (str) or raw bytes; must return PDF bytes."""
    ...

register_format_converter('.xyz', my_custom_converter)

# See everything that's currently supported
print(supported_extensions())

Get pdf from various sources

Example with a URL

pdf_data = get_pdf("https://pypi.org", src_kind="url")
print("Got PDF data of length:", len(pdf_data))

Example with HTML content

html_content = "<html><body><h1>Hello, PDF!</h1></body></html>"
pdf_data = get_pdf(html_content, src_kind="html")
print("Got PDF data of length:", len(pdf_data))

Example saving to file

filepath = get_pdf("https://pypi.org", egress="output.pdf", src_kind="url")
print("PDF saved to:", filepath)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdol-0.1.26.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdol-0.1.26-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file pdfdol-0.1.26.tar.gz.

File metadata

  • Download URL: pdfdol-0.1.26.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.26.tar.gz
Algorithm Hash digest
SHA256 604cf631726f255f59481699ba73ab86c143aeb9cd58fa8ba8727e5bca27087f
MD5 167acdec13af88b8c0d67595c888e694
BLAKE2b-256 4847e38bfaaf9c34cda87e9b081923f0111e155bc4d4705ced2b2f6ca01945c2

See more details on using hashes here.

File details

Details for the file pdfdol-0.1.26-py3-none-any.whl.

File metadata

  • Download URL: pdfdol-0.1.26-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.26-py3-none-any.whl
Algorithm Hash digest
SHA256 39db251178800fa3b4cdace7b3b2b3fb3520ace14cfd483a5cfcae4ce4c0136f
MD5 284fe7af2b57ddd7329b97961dcc4987
BLAKE2b-256 0c2cf6427a8799e946cfb556ef955bc37844764e2ba16800bb01471c608aa5c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page