Skip to main content

Parsing PDF files with PDFium

Project description

redstork

Build Status PyPI version Documentation Status

PDF Parsing library, based on PDFium.

Requirements

  • Python 3

Platfom support:

  • Fairly recent Linux (Ubuntu 18.04 or better). Older systems not supported.
  • MacOS 10.6 or better
  • Windows support in works

Installation

pip install redstork

Features

  • Convert to an image - page or arbitrary rectangle - using configurable scale
  • Update document meta
  • Update font encoding (for some PDF documents)
  • Save document to a file

Quick start

Download a sample PDF file from here

from redstork import Document, PageObject, Glyph

doc = Document('sample.pdf')

print('Number of pages:', len(doc))
>> Number of pages: 15

print('MediaBox of the first page is:', doc[0].media_box)
>> MediaBox of the first page is: (0.0, 0.0, 612.0, 792.0)

print('Rotation of the first page is:', doc[0].rotation)
>> Rotation of the first page is: 0

print('Document title:', doc.meta['Title'])
>> Document title: Red Stork

print('First page has', len(doc[0]), 'objects')
>> First page has 4 objects

doc[0].render('page-0.ppm', scale=2)   # render page #1 as image

page = doc[0]
for o in page:
    if o.type == PageObject.OBJ_TYPE_TEXT:
        for code, _, _ in o:
            print(o.font[code], end='')
        print()
>> RedStork
>> Release0.0.1
>> Apr02,2020

for fid, font in doc.fonts.items():
    print(font.short_name, fid)
>> NimbusSanL-Bold (36, 0)
>> NimbusSanL-BoldItal (37, 0)

# lets generate an SVG file of the first letter on page 1
text_object = [o for o in page if o.type == PageObject.OBJ_TYPE_TEXT][0]  # first text object
charcode, _, _ = text_object[0]  # first character of the first text object

glyph = font.load_glyph(charcode)
path, delayed_c = [], []
for x, y, op, close in glyph:
    x, y = round(x, 3), round(y, 3)
    if op == Glyph.MOVETO:
        path.append(f'M {x} {y}')
    elif op == Glyph.LINETO:
        path.append(f'L {x} {y}')
    elif op == Glyph.CURVETO:
        delayed_c.append(f'{x} {y}')
        if len(delayed_c) == 3:
            path.append('C ' + ', '.join(delayed_c))
            delayed_c.clear()
    if close:
        path.append('Z')
path = ' '.join(path)
print('<svg><g fill="gray" transform="scale(100,-100)"><path d="' + path + '" /></g></svg>')
>> <svg><g fill="gray" transform="scale(100,-100)"><path d="M 0.291 0.289 L 0.463 0.289 C 0.52 0.289, ... L 0.318 0.414 Z" /></g></svg>

API docs

https://red-stork.readthedocs.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

redstork-0.0.41-py3-none-manylinux1_x86_64.whl (6.9 MB view hashes)

Uploaded Python 3

redstork-0.0.41-py3-none-macosx_10_9_intel.whl (24.1 kB view hashes)

Uploaded Python 3 macOS 10.9+ intel

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page