This package provides tools like daemons and converters to ease access
to OpenOffice.org installations for Python programmers.
The main purpose of the whole package is to provide support for
converting office documents from Python using OpenOffice.org but
without the need to have PyUNO support with the Python binary that
actually runs your Python application (like Plone, for instance).
The complete documentation of the most recent release can be found at
A commandline script to start/stop OpenOffice.org as a daemon
(without X). While OOo brings this functionality out-of-the-box, the
deamon also monitors status of the OOo server and restarts it if
necessary.
pyunoctl
A commandline script to start/stop a converter daemon, that listens
for requests to convert office documents from .odt, .doc, .docx,
etc. to HTML or PDF using OpenOffice.org. Includes a caching
mechanism that holds docs already converted.
An API to access any pyunoctl daemon programmatically using Python.
ulif.openoffice is a Python package to support document
transformations using OpenOffice.org (OOo).
It provides components to ask a running OOo-server for document
conversions from office-type documents like .doc or .odt to HTML or
PDF. Using ulif.openoffice you can trigger such conversions via
commandline or via a Python-API that works also with Python versions
without any PyUNO support.
Furthermore, it provides a caching server that caches all documents
once converted and delivers them in case a document is requested
again. Depending on your needs this can speed-up things by factor 10
or more.
ulif.openoffice requires some PyUNO-capable Python somewhere to do
the actual conversions. It also provides a client-API for Python code that
does not provide that support. Current Debian-based distributions
normally offer a package for PyUNO support.
ulif.openoffice is tested on Debian-based systems, most notably
Ubuntu, and won’t work on Windows.
The package is designed for server-based deployments. While the
OOo-server is running, you cannot use the office-suite on your desktop
(at least at time of writing this). This is a limitation of OOo
itself.
ulif.openoffice mainly provides three different components:
An oooctl server that runs in background, starts a local
OOo-server and monitors its status. If the OOo server process dies,
it is restarted by oooctl.
A pyunoserver, which is a TCP-server that implements an own
protocol to listen for conversion requests. When a valid request
arrives, it tries to contact a local OOo server to do the
conversion.
The pyunoserver also runs a cache manager that caches already
converted documents and delivers them in case the conversioned
version already exists.
This component needs access to the PyUNO library.
A client library to talk to the PyUNO server. This component does
not require PyUNO.
The three components play together roughly as shown in the following
figure:
Fig. 1: Overview of ulif.openoffice components
The blue lines show the way from a source document (in .doc
format) to the OpenOffice.org server while the red lines show the
way back of the converted document (PDF).
Use of client-API, oooctl server and cache is optional.
All this currently happens on the same machine. There are plans for
support of multi-machine scenarios with distributed servers and
load-balancing features.
There are, unfortunately, zillions of possibilities why you cannot
start OpenOffice.org as in background on a system.
The scripts in here were tested with Ubuntu and work.
It is mandatory, that the system user running oooctl is a regular
user with at least a home directory. OpenOffice.org relies on that
directory to store information even in headless mode.
Recent OpenOffice.org versions require no X-server for running.
If you want to use a Ubuntu (or Debian) prepared install of OOo, you
must make sure, that you apt-get-installed the following packages:
openoffice.org-headless (for Ubuntu < 9.04, not needed for newer)
openoffice.org-java-common
additionally to the usual OOo packages, i.e.:
openoffice.org (at least for Ubuntu >= 9.04)
msttcorefonts
The latter is optional but needed to have the most common fonts used
with OpenOffice.org documents available. Without the correct fonts
installed, results of document transforms might be poor.
Then, you need at least one Python version, which supports:
$ python -c "import uno"
without raising any exceptions.
On newer Ubuntu versions you can install:
* ``python-uno`` (if available)
The clients and other software apart from the oooctl-server and the
pyuno-server can be run with a different Python version.
If you successfully installed this package on a different system, we’d
be glad to hear from you, especially, if you could tell, what
system-packages you used.
Instead of using zc.buildout you can also use easy_install.
If using easy_install, you might have to install the package
twice: one time with a Python binary that support PyUNO and one time
with a Python binary that will be used by your application.
Make sure, you have at least one Python version that supports PyUNO.
See Prerequisites above.
For this Python-version install easy_install (only needed if
not already existent, of course:
There are four main components that come with ulif.openoffice:
an oooctl-server that starts OpenOffice.org in background.
a pyuno-server that listens for requests to convert docs. This
server depends on a running oooctl-server.
a client component that can be accessed via API and can talk to the
pyuno-server. This way you can convert docs from Python and the
Python version has not to provide the uno lib.
a converter script (also in ./bin), you can use on the
commandline. It depends on a running oooctl server and can convert
docs to .txt, .html and .pdf format. It is merely a little test
programme that was used during development, but you might have some
use for it.
We can start an OpenOffice.org daemon using the oooctl script. This
daemon starts an already installed OpenOffice.org instance as server
(without GUI, so it is usable on servers).
The oooctl script is defined in setup.py to be installed as a
console script, so if you install ulif.openoffice with
easy_install or setup.py, an executable script will be installed
in your local bin/ directory.
Here we ‘fake’ this install by using buildout, which will install the
script in our test environment.
The main actions are to call the script with one of the:
start|fg|stop|restart|status
commands as argument, where fg means: start in foreground. This
can be handy, if you want the process to be monitored by third-party
tools like some supervisor daemon or similar. In that case the process
will not detach from the invoking shell on startup.
oooctl needs to know, which OOo install should be used and where it
lives. We can set this path to the binary using the -b or
--binarypath switch of oooctl.
By default this path is set to:
>>> from ulif.openoffice.oooctl import OOO_BINARY
>>> OOO_BINARY
'/usr/lib/openoffice/program/soffice'
which might not be true for your local system.
For our local test we create an executable script which will fake a
real OpenOffice.org binary:
>>> import sys
>>> write('fake_soffice',
... '''#!%s
... import sys
... import pprint
... sys.stdout.write("Fake soffice started with these options/args:\\n")
... pprint.pprint(sys.argv)
... sys.stderr.flush()
... sys.stdout.flush()
... while 1:
... pass
... ''' % sys.executable)
This script will simply loop forever (well, sort of). We determine the
exact absolute path of our ‘binary’:
>>> import os
>>> soffice_path = os.path.join(os.getcwd(), 'fake_soffice')
We must make this script executable:
>>> os.chmod('fake_soffice', 0700)
Now we can call the daemon and tell it to start our faked office
server:
>>> print system("%s -b %s start" % (join('bin', 'oooctl'), soffice_path))
starting OpenOffice.org server, going into background...
started with pid ...
<BLANKLINE>
By default the daemonized programme’s output will be redirected to
/dev/null. You can, however use the --stdout, --stderr and
--stdin options to set appropriate log files.
In the logfile we can see what arguments and options the daemon used:
>>> cat (tmp_path)
Fake soffice started with these options/args:
['/sample-buildout/fake_soffice',
'-accept=socket,host=localhost,port=2002;urp;',
'-headless',
'-nologo',
'-nofirststartwizard',
'-norestore']
This script starts a server in background that allows conversion of
documents using the pyUNO API. It requires a running OO.org server in
background (see above).
Currently conversion from all OOo readable formats (.doc, .odt, .txt,
…) to HTML and PDF-A is supported. This means, if you can load a
document with OpenOffice.org, then this daemon can convert it to HTML
or PDF-A.
The conversion daemon starts a server in background (unless you
specify fg as startmode, which will keep the server attached to
the invoking shell) which listens for conversion requests on a TCP
port. It then calls OpenOffice.org via the pyUNO-API to perform the
conversion and responses with the path of the generated doc (or an
error message).
The conversion server is a multithreaded asynchronous TCP daemon. So,
several requests can be served at the same time.
Once, the daemon started we can send requests. One of the commands we
can send is to test environment, connection and all that. For this, we
need a client that sends commands and parses the responses for us. It
is not difficult to write an own client (few lines of socket code will
do), but if you’re writing third party software you might use the
ready-for-use client from ulif.openoffice.client, which should give
you a more consistent API over time and can hide changes in protocol
etc.
Using the client in simple form can be done like this:
>>> from ulif.openoffice.client import PyUNOServerClient
>>> def send_request(ip, port, message):
... client = PyUNOServerClient(ip, port)
... result = client.sendRequest(message)
... ok = result.ok and 'OK' or 'ERR'
... return '%s %s %s' % (ok, result.status, result.message)
The client returns response objects, which always contain:
ok
a boolean flag indicating whether the request succeeded
status
a number indicating the response status. Stati are generally
leaned on HTTP status messages, so 200 means ‘okay’ while any
other number indicates some problem in processing the request.
message
Any readable output returned by the server. This includes paths or
more verbose error messages in case of errors.
Commands sent always have to be closed by newlines:
>>> command = 'TEST\n'
As the default port is 2009, we can call the client like this:
>>> print send_request('127.0.0.1', 2009, command)
OK 0 0.2dev
The response tells us that
the request could be handled (‘OK’),
the status is zero (=no problems),
the version number of the server (‘0.2dev’).
If we send garbage, we get an error:
>>> command = 'Blah\n'
>>> print send_request('127.0.0.1', 2009, command)
ERR 550 unknown command. Use CONVERT_HTML, CONVERT_PDF or TEST.
Here the server tells us, that
the request could not be handled (‘ERR’)
the status is 550
a hint, what commands we can use to talk to it.
As we can see, we are normally using HTTP status codes. This is also a
measure to allow simple switch to HTTP somewhen in the future.
Before we go on, we have to give the server time to start up:
The response will contain a status (HTTP equivalent number), a boolean
flag indicating whether conversion was performed successfully and a
message, which in case of success contains the path of the generated
document:
>>> response.status
200
>>> response.ok
True
>>> response.message
'/tmp/.../simpledoc1.pdf'
Result directories returned by the client are always temporary
directories which can be used by the caller.
Instead of giving a path, we can also use the client with a
filename parameter and the contents of the file to be
converted. For this, we use the clients convertToPDF method. This
consumes slightly more time than the method above:
The response will contain a status (HTTP equivalent number), a boolean
flag indicating whether conversion was performed successfully and a
message, which in case of success contains the path of the generated
HTML document. All embedded files that belong to that document are
stored in the same directory as the HTML file:
>>> response.status
200
>>> response.ok
True
>>> response.message
'/tmp/.../simpledoc1.html'
Instead of giving a path, we can also use the client with a
filename parameter and the contents of the file to be
converted. For this, we use the clients convertToHTML method. This
consumes slightly more time than the method above:
Again, the message attribute of the response tells us, where the
generated doc can be found:
>>> response.message
'/.../simpledoc1.html'
This time the document was created inside a temporary directory,
created only for this request. You should not make assumptions about
this location. All accompanied documents like images, etc. are stored
in the same directory.
Note, that the user that run OO.org server, will need a valid home
directory where OOo stores data. We create such a home in the
testsetup in the home directory:
>>> print "HOMEDIR>\n", ls('home')
HOMEDIR...
d .openoffice.or...
d .pyunocache
- newdoc1.doc
- simpledoc1.doc
...
The home also contains the cache dir for the PyUNOServer.
This script starts a server in background that allows conversion of
documents using the pyUNO API. It requires a running OO.org server in
background (see above).
Apart from usage in standard raw mode, pyunoctl can also be started
as a RESTful HTTP daemon. This enables usage from remote, as all
communication is done using the HTTP protocol (including sending and
receiving files).
The RESTful HTTP mode can be enabled by setting the:
--mode=rest
option of pyunoctl.
We start pyunoctl in RESTful mode. The OOo daemon was already
started before.
>>> print system(join('bin', 'pyunoctl') + ' --stdout=/tmp/out '
... + '--mode=rest start')
startung RESTful HTTP server, going into background...
started with pid ...
<BLANKLINE>
We send a simple test request, that should give us a status:
We GET documents from the server by asking for an existing MD5sum. The
MD5 sum of a document is also its resource name on the server. If a
document does not exist, we get a 404 error:
We ask for conversion (creating a resource), simply by POSTing a
document. As creating POST request is a bit more complex, we use
utility functions from the util module:
Added license and copyright file to comply with policy of major
Linux distributors.
Added sphinx docs.
Fixed wrong result path when returning cached HTML results.
Added mode fg for oooctl. Using oooctl fg one can start
oooctl in foreground now.
Added mode fg for pyunoctl. Using pyunoctl fg one can start
pyunoctl in foreground now.
Added state check for oooctl: when OpenOffice.org server is down
during runtime it is restarted automatically. The check happens
every second.
Use standard lib doctest instead of zope.testing.doctest.
Changed PDF creation: by default now normal PDF (and not PDF/A) is
created when converting to PDF. This is due to an endianess bug in
many recent OpenOffice.org binaries running on 64-bit platforms.