Skip to main content

Provides resources to handle OpenXML documents.

Project description

openxmllib-py3

This is a fork of [openxmllib](https://github.com/glenfant/openxmllib) with Python3 support.

openxmllib is a set of tools that deals with the new ECMA 376 office file formats known as OpenXML.

http://www.ecma-international.org/publications/standards/Ecma-376.htm

OpenXML format is used by Microsoft Office 2007 and later. Apple iWork and OpenOffice have filters to use this format too, starting from iWork’08 and OO version 2.2.

Features

Tested features

  • Extract words from a document for indexing purpose.

  • Get metadata from a document

  • Add OpenXml mimetypes to standard mimetypes module.

  • Extract cover thumbnail image, if the document contains it

Planned features

  • Transform a document to HTML

Public API

These examples say all:

>>> import openxmllib
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # Raises a ValueError on not supported office files.
>>> doc.mimeType
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
>>> doc.coreProperties # Keys may depend on application
{'title': u'blah...', u'creator': u'John Doe', ...}
>>> doc.extendedProperties # Keys may depend on application
{'Words': u'312', 'Application': u'Your favorite word processor', ...}
>>> doc.customProperties # May return an empty mapping
{'My property': u'My value', ...}
>>> doc.allProperties # Merges core+extended+custom properties (see above)
{...}
>>> doc.indexableText(include_properties=False)
u'all the words of that document body'
>>> doc.indexableText(include_properties=True)
u'all the words of that document body and all properties values'
>>> doc.documentCover()
('jpg', <open file '/var/folders/.../docProps/thumbnail.jpeg', mode 'rb' at 0x1af300>)

Standard mimetypes package extensions

>>> import mimetypes
>>> mimetypes.guess_type('somedoc.docx')
('application/vnd.openxmlformats-officedocument.wordprocessingml.document', None)
>>> mimetypes.guess_type('somecalc.xlsx')
('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', None)
>>> mimetypes.guess_type('someslides.pptx')
('application/vnd.openxmlformats-officedocument.presentationml.presentation', None)

Document factory signatures:

>>> # We have the path for the office file
>>> doc = openxmllib.openXmlDocument(path='office.docx')
>>> # We have a file object for the office file
>>> fh = open('office.docx', 'rb')
>>> doc = openxmllib.openXmlDocument(file_='office.docx')
>>> # We have the URL for the office file
>>> doc = openxmllib.openXmlDocument(url='http://domain.tld/office.docx')
>>> # Xe have the raw data of the office file
>>> import mimetypes
>>> docx_mimetype = mimetypes.guess_type('office.docx')
>>> body = open('office.docx', 'rb').read()
>>> doc = open(data=body, mime_type=docx_mimetype)

Note that if you’re not running a Python application, you may get the indexable text from a document with the openxmlinfo.py console utility. Just type:

$ openxmlinfo --help

Copying and License

Copyright (c) 2008 Gilles Lenfant

This software is subject to the provisions of the GNU General Public License, Version 2.0 (GPL). A copy of the GPL should accompany this distribution. THIS SOFTWARE IS PROVIDED “AS IS” AND ANY AND ALL EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE

More details in the COPYING file included in this package.

Status

Starting from version 2.0, this package is tested using Python 3.8.x on Linux. If dependencies can be met, it will most likely work on other environments as well.

Installation

Using the usual setuptools command:

$ pip install openxmllib-py3

Note that this will install the excellent lxml egg too if not already done.

From now you can “import openxmllib” in your Python apps and use the “openxmlinfo” command line utility.

Gotchas

Be aware that most text data coming from the various openxmllib services might be us-ascii or Unicode. This is a side effect of lxml (bug or feature ?). It’s up to your application to convert these texts to the appropriate charset.

We do not actually handle exceptions due to malformed XML or various unexpected structures. You should handle the various (potential) problems in a try (…) except (…) block in your application.

Developing and testing

You should grab openxmllib from its repository at https://github.com/wilbertom/openxmllib-py3.

Then:

$ cd /where/you/installed/openxmllib
$ python setup.py develop

Note that testing does not require the installation:

$ cd tests
$ python runalltests.py

Support

Use the issue tracker provided from the project site.

Credits

  • Gilles Lenfant [gilles.lenfant] <gilles dot lenfant at gmail dot com>

  • Kevin Deldycke [kevin.deldycke] <kevin at deldycke dot com>

  • Hugo Lopes Tavares [hltbra] <hltbra at gmail dot com>

  • Petri Savolainen [petri] <petri dot savolainen at koodaamo dot fi>

  • Eric Wohnlich [ewohnlich] <https://github.com/ewohnlich>

  • Wilberto Morales [wilbertom] <https://github.com/wilbertom/>

Future features and bugfixes

Features

Remove downloaded temporary file

When data is coming from HTTP (…) URL, it’s stored in a temporary file that’s not deleted after processing.

Support for standard mimetypes module

Add our mime types to standard Python module.

Human readable plain text conversion

>>> from openxmllib import openXmlDocument
>>> doc = openXmlDocument(...)
>>> doc.textDocument(target_directory)

(this may be not possible for spreadsheets)

HTML conversions

>>> from openxmllib import openXmlDocument
>>> doc = openXmlDocument(...)
>>> doc.htmlDocument(target_directory)

This requires to find open source XSLT stylesheets.

Document generation

FIXME: more to say here

Bugfixes

…Waiting for feedback ;o)

History

1.1.1

  • Fix text extraction from Word template (.dotx) documents [pdpotter]

1.1

  • New feature: document cover image extraction (when present) [petri]

  • Remove old pointers in README etc. pointing to old google code repo [petri]

  • Update lxml dependency (require >= 3.4.0 now) [petri]

1.0.7

  • Fixed setup.py that imports indirectly lxml. Raises failure in buildout. Issue # 11 [gilles.lenfant]

  • unit tests temporary http server did not work [gilles.lenfant]

1.0.6

  • The bug of mid word style change is still not fixed in presentation and spreadsheets :/ Anyway, we needed an API sanitazation. [gilles.lenfant]

  • Factory API changed for a safer and faster document object construction. [gilles.lenfant]

  • Added support for new mime types that are not in the standard mimetypes module. [gilles.lenfant]

1.0.5

  • Optims on large documents. [gilles.lenfant]

  • CamelCased functions and method names in consistency with applied rules. [gilles.lenfant]

  • Version reset to 1.0.5 [gilles.lenfant]

  • Support for urllib compatible URLs [gilles.lenfant]

  • New: Support for URLs [hltbra]

  • Fixed implementation to that old tests pass (the “midword”/”metadata” case, bold + normal style was not ok) [hltbra]

1.0.4

  • Compliance with python 2.5 and lxml 2.2 Still works with python 2.4 and lxml 1.3.6 [gilles.lenfant]

  • Automate package and version definition

  • Bump version to 1.0.4 2008-12-11 [kevin.deldycke]

1.0.3

  • Conforming XPath constructor signature. [gilles.lenfant]

  • New test files built with Mac Office 2008 [gilles.lenfant]

1.0.2

  • Fix bad “egging”. [kevin.deldycke]

1.0.1

  • Egg-ification. [kevin.deldycke]

1.0.0

  • First public version. [gilles.lenfant]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openxmllib-py3-2.0.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

openxmllib_py3-2.0.0-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file openxmllib-py3-2.0.0.tar.gz.

File metadata

  • Download URL: openxmllib-py3-2.0.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for openxmllib-py3-2.0.0.tar.gz
Algorithm Hash digest
SHA256 d0a51f3e0a179a05ed768f9aa51cc3b0eda15bb793108b086245d20ca359967f
MD5 1f3a94ad681c648f15483d8d3a67d667
BLAKE2b-256 73e7a65ba2dfa44b66993e11301cf130b99783b06daca43de1b67c6125e00a2b

See more details on using hashes here.

File details

Details for the file openxmllib_py3-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: openxmllib_py3-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for openxmllib_py3-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76554cf608923db025610eb797145f402f0ec3661b7136db8e6d13c868a28cc9
MD5 9b779f8150d79fded9552092e496a04f
BLAKE2b-256 d0d196334aa8059d84e780a1764ba7fe4eb3335932199a678c060a87ccba164d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page