Skip to main content

Parsing PDF files with PDFium

Project description

redstork

Build Status PyPI version Documentation Status

PDF Parsing library, based on PDFium.

Requirements

  • Python 3

Platfom support:

  • Fairly recent Linux (Ubuntu 18.04 or better). Older systems not supported.
  • MacOS 10.6 or better
  • Windows support in works

Installation

pip install redstork

Features

  • Convert to an image - page or arbitrary rectangle - using configurable scale
  • Update document meta
  • Update font encoding (for some PDF documents)
  • Save document to a file

Quick start

Download a sample PDF file from here

from redstork import Document, PageObject, Glyph

doc = Document('sample.pdf')

print('Number of pages:', len(doc))
>> Number of pages: 15

print('MediaBox of the first page is:', doc[0].media_box)
>> MediaBox of the first page is: (0.0, 0.0, 612.0, 792.0)

print('Rotation of the first page is:', doc[0].rotation)
>> Rotation of the first page is: 0

print('Document title:', doc.meta['Title'])
>> Document title: Red Stork

print('First page has', len(doc[0]), 'objects')
>> First page has 4 objects

doc[0].render('page-0.ppm', scale=2)   # render page #1 as image

page = doc[0]
for o in page:
    if o.type == PageObject.OBJ_TYPE_TEXT:
        for code, _, _ in o:
            print(o.font[code], end='')
        print()
>> RedStork
>> Release0.0.1
>> Apr02,2020

for fid, font in doc.fonts.items():
    print(font.short_name, fid)
>> NimbusSanL-Bold (36, 0)
>> NimbusSanL-BoldItal (37, 0)

# lets generate an SVG file of the first letter on page 1
text_object = [o for o in page if o.type == PageObject.OBJ_TYPE_TEXT][0]  # first text object
charcode, _, _ = text_object[0]  # first character of the first text object

glyph = font.load_glyph(charcode)
path, delayed_c = [], []
for x, y, op, close in glyph:
    x, y = round(x, 3), round(y, 3)
    if op == Glyph.MOVETO:
        path.append(f'M {x} {y}')
    elif op == Glyph.LINETO:
        path.append(f'L {x} {y}')
    elif op == Glyph.CURVETO:
        delayed_c.append(f'{x} {y}')
        if len(delayed_c) == 3:
            path.append('C ' + ', '.join(delayed_c))
            delayed_c.clear()
    if close:
        path.append('Z')
path = ' '.join(path)
print('<svg><g fill="gray" transform="scale(100,-100)"><path d="' + path + '" /></g></svg>')
>> <svg><g fill="gray" transform="scale(100,-100)"><path d="M 0.291 0.289 L 0.463 0.289 C 0.52 0.289, ... L 0.318 0.414 Z" /></g></svg>

API docs

https://red-stork.readthedocs.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for redstork, version 0.0.41
Filename, size File type Python version Upload date Hashes
Filename, size redstork-0.0.41-py3-none-macosx_10_9_intel.whl (24.1 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size redstork-0.0.41-py3-none-manylinux1_x86_64.whl (6.9 MB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page