Skip to main content

Parsing PDF files with PDFium

Project description

redstork

Build Status PyPI version Documentation Status

PDF Parsing library, based on PDFium.

Requirements

  • Python 3

Platfom support:

  • Fairly recent Linux (Ubuntu 18.04 or better). Older systems not supported.
  • MacOS 10.6 or better
  • Windows support in works

Installation

pip install redstork

Features

  • Convert to an image - page or arbitrary rectangle - using configurable scale
  • Update document meta
  • Update font encoding (for some PDF documents)
  • Save document to a file

Quick start

Download a sample PDF file from here

from redstork import Document, PageObject, Glyph

doc = Document('sample.pdf')

print('Number of pages:', len(doc))
>> Number of pages: 15

print('MediaBox of the first page is:', doc[0].media_box)
>> MediaBox of the first page is: (0.0, 0.0, 612.0, 792.0)

print('Rotation of the first page is:', doc[0].rotation)
>> Rotation of the first page is: 0

print('Document title:', doc.meta['Title'])
>> Document title: Red Stork

print('First page has', len(doc[0]), 'objects')
>> First page has 4 objects

doc[0].render('page-0.ppm', scale=2)   # render page #1 as image

page = doc[0]
for o in page:
    if o.type == PageObject.OBJ_TYPE_TEXT:
        for code, _, _ in o:
            print(o.font[code], end='')
        print()
>> RedStork
>> Release0.0.1
>> Apr02,2020

for fid, font in doc.fonts.items():
    print(font.short_name, fid)
>> NimbusSanL-Bold (36, 0)
>> NimbusSanL-BoldItal (37, 0)

# lets generate an SVG file of the first letter on page 1
text_object = [o for o in page if o.type == PageObject.OBJ_TYPE_TEXT][0]  # first text object
charcode, _, _ = text_object[0]  # first character of the first text object

glyph = font.load_glyph(charcode)
path, delayed_c = [], []
for x, y, op, close in glyph:
    x, y = round(x, 3), round(y, 3)
    if op == Glyph.MOVETO:
        path.append(f'M {x} {y}')
    elif op == Glyph.LINETO:
        path.append(f'L {x} {y}')
    elif op == Glyph.CURVETO:
        delayed_c.append(f'{x} {y}')
        if len(delayed_c) == 3:
            path.append('C ' + ', '.join(delayed_c))
            delayed_c.clear()
    if close:
        path.append('Z')
path = ' '.join(path)
print('<svg><g fill="gray" transform="scale(100,-100)"><path d="' + path + '" /></g></svg>')
>> <svg><g fill="gray" transform="scale(100,-100)"><path d="M 0.291 0.289 L 0.463 0.289 C 0.52 0.289, ... L 0.318 0.414 Z" /></g></svg>

API docs

https://red-stork.readthedocs.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

redstork-0.0.41-py3-none-manylinux1_x86_64.whl (6.9 MB view details)

Uploaded Python 3

redstork-0.0.41-py3-none-macosx_10_9_intel.whl (24.1 kB view details)

Uploaded Python 3macOS 10.9+ Intel (x86-64, i386)

File details

Details for the file redstork-0.0.41-py3-none-manylinux1_x86_64.whl.

File metadata

  • Download URL: redstork-0.0.41-py3-none-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.17

File hashes

Hashes for redstork-0.0.41-py3-none-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 913609b8ce86e30167fa19aa39f7b89d1e4688c59189ccf2e91b8753484d8fe9
MD5 cb83845d9cad43bd8592f13e1c381e4c
BLAKE2b-256 360ed4aa12f9b37b359b7866c826a7be3e38ec9dadc5057a59a5cba32bd2d743

See more details on using hashes here.

File details

Details for the file redstork-0.0.41-py3-none-macosx_10_9_intel.whl.

File metadata

  • Download URL: redstork-0.0.41-py3-none-macosx_10_9_intel.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3, macOS 10.9+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.17

File hashes

Hashes for redstork-0.0.41-py3-none-macosx_10_9_intel.whl
Algorithm Hash digest
SHA256 4687f549f56cdcd07d999a8b3c6ac496f44184a55113fb12ca1fb303db3be8f0
MD5 8c354496ed9a2d7d6593b4d3d9b9ea7d
BLAKE2b-256 bdd932f8d01d9234894d8a7cad6455d824e6b5393884a80296f24b86eba8b1ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page