Skip to main content

Simple, Pythonic extraction of images, text, and shapes from PDFs

Project description

Travis CI build status (Linux) Coverage Status

minecart is a Python package that simplifies the extraction of text, images, and shapes from a PDF document. It provides a very Pythonic interface to extract positioning, color, and font metadata for all of the objects in the PDF. It is a pure-Python package (it depends on pdfminer for the low-level parsing). minecart takes inspiration from Tim McNamara’s slate , but aims to provide more detailed information:

>>> pdffile = open('example.pdf', 'rb')
>>> doc = minecart.Document(pdffile)
>>> page = doc.get_page(3)
>>> for shape in page.shapes.iter_in_bbox((0, 0, 100, 200)):
...     print shape.path, shape.fill.color.as_rgb()
>>> im = page.images[0].as_pil()  # requires pillow
>>> im.show()

Installation

Currently only Python 2.7 is supported. Support for 3.4+ (using pdfminer.six) is planned.

  1. The easy way: pip instal minecart

  2. The hard way: download the source code, change into the working directory, and run python setup.py install

For CJK languages: Supporting the CJK languages requires an addtional step, as detailed in pdfminer.

Currently supported features

  • Shapes: You can extract path information, bounding box, stroke parameters, and stroke/fill colors. Color support is fairly robust, allowing the simple .as_rgb() in most cases. (To be concrete, minecart supports the DeviceRGB, DeviceCMYK, DeviceGray, and CIE-based color spaces. Indexed colors are supported if they index into one of the above.)

  • Images: minecart can easily extract images to PIL.Image objects.

  • Text: (Called Lettering in the source) In addition to extracting plain text from the PDF, you have access to position/bounding box information and the font used.

If there’s a feature you’d like to extract from a PDF that’s not currently supported, open up an issue or submit a pull request! I’m especially interested in hearing whether there are many PDFs using color spaces outside of the ones currently supported.

Documentation

The main entry point will always be minecart.Document, which accepts a single parameter, an open file-like object which will be read to create the document. The Document has two primary methods for accessing its contents: .get_page(num) and iter_pages(). Both methods return minecart.Page objects, which provide access to the graphical elements found on the page. Page objects have three main attributes:

  • .images: A list of all the minecart.Image objects found on the page.

  • .letterings: A list of all the text objects found on the page, as Lettering objects. Lettering is a unicode subclass which adds bounding box and font information (using .get_bbox() or .font).

  • .shapes: A list of all the squares, circles, lines, etc. found on the page as Shape objects. Shape objects have three main attributes of interest:

    • stroke: An object containing the stroke parameters used to draw the shape. .stroke has .color, .linewidth, .linecap, .linejoin, .miterlimit, and .dash attributes. If the shape was not stroked, .stroke will be None.

    • .fill: An object containing the fill parameters used to draw the shape. Right now, .fill only has a .colorparameter.

    • .path: A list with the coordinates used to defined the shape, as well as the type of line segment each set of coordinates defines. Refer to the minecart.Shape documentation for more details

I try to keep docstrings complete and up to date, so you can read through the source or use dir and help to see what methods are available. Most of the public interface is implemented in the content class, and miner has more of the PDF nitty-gritty stuff.

Contributing

Bug reports are always welcome (using the GitHub tracker) as are feature requests. The PDF spec has so many corners, it is hard for me to prioritize implementing access to its various features. If there’s something you’d like to extract from a document but isn’t currently supported, please create a new issue.

If you’d like to contribute code, you can either create an issue and include a patch (if the changes are small) or fork the project and create a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minecart-0.2.zip (24.8 kB view details)

Uploaded Source

Built Distribution

minecart-0.2-py2-none-any.whl (21.9 kB view details)

Uploaded Python 2

File details

Details for the file minecart-0.2.zip.

File metadata

  • Download URL: minecart-0.2.zip
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for minecart-0.2.zip
Algorithm Hash digest
SHA256 b38331bb3a25551e6c56d1b31875a1b61c6bff775f80bf0e88a5770fd94d93d1
MD5 c331dd1ee19e1cdbfc77a6d268d9418c
BLAKE2b-256 62fb14e8cfb8455db1d902276cce5e1984d45d474628cfb8a70f94fa1bc762fb

See more details on using hashes here.

File details

Details for the file minecart-0.2-py2-none-any.whl.

File metadata

File hashes

Hashes for minecart-0.2-py2-none-any.whl
Algorithm Hash digest
SHA256 872ed0e0d91e148ef3d8f2a4a625f78a7f4afa97e7b70d794986152a070c5f0c
MD5 ef7cf49531df9cee5e6e1c19bd10501c
BLAKE2b-256 43d70ae2e3f70665f7cda42cbe61126d68b7d577bd0036589fef8aeee071c25c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page