gymnast

Gymnast: PDF document parser in Python 3

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Text Processing
- Utilities

Project description

Gymnast: It’s not Acrobat

PDF parser written in Python 3 (backport to 2.7 in the works). This was designed to provide a Pythonic interface to access (and, eventually, write) Adobe PDF files. Some of attributes have non-Pythonic capitalization, but that is to match the underlying structure of the PDF document (doing otherwise would get very confusing).

Usage

import io
from gymnast          import PdfDocument
from gymnast.renderer import PdfBaseRenderer

class PdfSimpleRenderer(PdfBaseRenderer):
    """Simple renderer example that just extracts text with no processing"""
    def __init__(self, page):
        super().__init__(page)
        self._text = io.StringIO()
    def _render_text(self, text, new_state):
        self._text.write(self.active_font.decode_string(text))
    def _return(self):
        return self._text.getvalue()

fname = '/path/to/file.pdf'
pdf   = PdfDocument(fname).parse()
text  = SimpleRenderer(pdf.Pages[-3]).render()

TODO (in no particular order)

Features and functionality
[x] Rewrite the parser and document class to lazy-load the document based on the xrefs table
[x] Complete the base page renderer
[ ] Page Rendering
- [x] Getting the BaseRenderer class working
- [x] Implement a proof of concept extractor that just dumps strings
- [ ] Get a bit fancier, assigning textblocks to lines and such
[ ] Handle page numbering more fully
- [ ] Add a method to PdfDocument to get a page by number
- [ ] Add propreties to PdfPage for the page number (both as an int and a formatted str according to PdfDocument.Root.PageLabels['Nums'])
[ ] Backport to Python 2.7 (about 80% done or so)
[ ] Font stuff
- [x] Carve the PdfFont class into an abstract PdfBaseFont and a PdfType1Font implementation
- [x] PdfFont.__new__ will pick the correct subclass based on the font’s Subtype element
- [x] PdfBasefFont class will also have an abstract method for the glyph space to text space transformation
- [ ] Add subcless for Type3 fonts
- [x] Add subcless for TrueType fonts
- [ ] Add subcless for composite fonts
- [x] Add legacy support for the 14 standard fonts
- [ ] Font-to-unicode CMAPs
[ ] Implement the remaining StreamFilters (will probably have the image ones return a PIL.Image)
- [ ] RunLengthDecode
- [ ] CCITTFaxDecode
- [ ] JBIG2Decode
- [ ] DCTDecode
- [ ] JPXDecode
- [ ] Crypt
[ ] Implement remaining object types
- [ ] ObjStm
- [x] XRef
- [ ] Filespec
- [ ] EmbeddedFile
- [ ] CollectionItem / CollectionSubitem
- [ ] XObject
[ ] Handle document encryption
[ ] Start on graphics stuff (maybe)
[ ] Interactive forms (AcroForms)
Administrative
[ ] Write tests for existing code
[x] Come up with a better name
[ ] Document everything much, much better internally
[ ] Package it up neatly and pypi it
[ ] Write some proper documentation

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Text Processing
- Utilities

Release history Release notifications | RSS feed

This version

0.1a5 pre-release

Nov 18, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gymnast-0.1a5.zip (212.0 kB view details)

Uploaded Nov 18, 2015 Source

File details

Details for the file gymnast-0.1a5.zip.

File metadata

Download URL: gymnast-0.1a5.zip
Upload date: Nov 18, 2015
Size: 212.0 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for gymnast-0.1a5.zip
Algorithm	Hash digest
SHA256	`66eeb12762d7af83acacf2c3af69ab0af282cb7150167106ac6aef2e05de0b51`
MD5	`09ad7634c63246c7bc128d55ec9abae4`
BLAKE2b-256	`0d472bf455d7817f0fe22de578d3467892fbb9eca551684b3cc6ae963153f96c`

See more details on using hashes here.

gymnast 0.1a5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gymnast: It’s not Acrobat

Usage

TODO (in no particular order)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes