Skip to main content

Data Object Layer for PDF data

Project description

pdfdol

Data Object Layer for PDF data

To install: pip install pdfdol

Documentation

Examples

Pdf "Stores"

Get a dict-like object to list and read the pdfs of a folder, as text:

>>> from pdfdol import PdfFilesReader
>>> from pdfdol.tests import get_test_pdf_folder
>>> folder_path = get_test_pdf_folder()
>>> pdfs = PdfFilesReader(folder_path)
>>> sorted(pdfs)
['sample_pdf_1', 'sample_pdf_2']
>>> assert pdfs['sample_pdf_2'] == [
...     'Page 1\nThis is a sample text for testing Python PDF tools.'
... ]

See that the values of a PdfFilesReader are lists of pages. If you need strings (i.e. all the pages together) you can add a decoder like so:

from dol import add_decoder
page_separator = '---------------------'
pdfs = add_decoder(pdfs, decoder=page_separator.join)

If you need this at the level of the class, just do this:

from dol import add_decoder
page_separator = '---------------------'
FilesReader = add_decoder(PdfFilesReader, decoder=page_separator.join)
# and then
pdfs = FilesReader(folder_path)
# ...

If you need to concatinate a bunch of pdfs together, you can do so in many ways. Here's one:

from dol import Files
from pdfdol import concat_pdfs

s = Files('~/Downloads/cosmograph_documentation_pdfs/')
concat_pdfs(s, key_order=sorted)

Converting ebooks and documents to PDF (optional Calibre integration)

pdfdol natively converts images, HTML, and Markdown to PDF. For additional formats -- EPUB, MOBI, DOCX, ODT, DJVU, RTF, and many more -- install Calibre, which provides the ebook-convert command-line tool.

pdfdol does not depend on Calibre; it auto-detects the tool at runtime and uses it only for formats that have no built-in converter.

from pdfdol import ebook_convert_to_pdf, find_ebook_convert

# Check whether Calibre is available
if find_ebook_convert():
    pdf_bytes = ebook_convert_to_pdf("book.epub")

You can also go through the usual get_pdf entry point -- it will automatically route to ebook-convert when it recognises the file extension:

from pdfdol import get_pdf
pdf_bytes = get_pdf("book.epub")                     # returns PDF bytes
get_pdf("book.epub", egress="book.pdf")              # saves to file

Custom converters

pdfdol maintains a format converter registry that maps file extensions to converter functions. You can register your own:

from pdfdol import register_format_converter, supported_extensions

def my_custom_converter(source):
    """source is a filepath (str) or raw bytes; must return PDF bytes."""
    ...

register_format_converter('.xyz', my_custom_converter)

# See everything that's currently supported
print(supported_extensions())

Get pdf from various sources

Example with a URL

pdf_data = get_pdf("https://pypi.org", src_kind="url")
print("Got PDF data of length:", len(pdf_data))

Example with HTML content

html_content = "<html><body><h1>Hello, PDF!</h1></body></html>"
pdf_data = get_pdf(html_content, src_kind="html")
print("Got PDF data of length:", len(pdf_data))

Example saving to file

filepath = get_pdf("https://pypi.org", egress="output.pdf", src_kind="url")
print("PDF saved to:", filepath)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfdol-0.1.24.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfdol-0.1.24-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file pdfdol-0.1.24.tar.gz.

File metadata

  • Download URL: pdfdol-0.1.24.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.24.tar.gz
Algorithm Hash digest
SHA256 6bb3de811234cc31491ce3a9d8dc63bdd819fcea33ddaf5647f08aae11454e99
MD5 572362e7abe29bc7932e1f032931eca0
BLAKE2b-256 513be9f939a3456463071c14a2af27bc05f70d5f603583f5a5e31f2dd3cea077

See more details on using hashes here.

File details

Details for the file pdfdol-0.1.24-py3-none-any.whl.

File metadata

  • Download URL: pdfdol-0.1.24-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pdfdol-0.1.24-py3-none-any.whl
Algorithm Hash digest
SHA256 e86f486b1b7aade9d125bb0d0f43bf58e62ae95929f8743908ebde7fa9be46b8
MD5 3162ce48c732446b35251439233d9a42
BLAKE2b-256 9426c2a09e51aa45f9c283d39390c466169767666d718ac3d4f0e171440ac9bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page