This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

ftw.tika

This product integrates Apache Tika for full text indexing with Plone by providing portal transforms to text/plain for the various document formats supported by Tika.

Compatibility

ftw.tika is compatible with Plone 4.x and the Tika versions listed below (Plone and Tika versions can be mixed and matched).

Tika 1.5   Plone 4.1
Tika 1.6   Plone 4.2
Tika 1.7   Plone 4.3
Tika 1.8    
Tika 1.9    
Tika 1.10    
Tika 1.11    

Supported Formats

Input Formats

  • Microsoft Office formats (Office Open XML)
    • *.docx Word Documents
    • *.dotx Word Templates
    • *.xlsx Excel Sheets
    • *.xltx Excel Templates
    • *.pptx Powerpoint Presentations
    • *.potx Powerpoint Templates
    • *.ppsx Powerpoint Slideshows
  • Legacy Microsoft Office (97) formats
  • Rich Text Format
  • OpenOffice ODF formats
  • OpenOffice 1.x formats
  • Common Adobe formats (InDesign, Illustrator, Photoshop)
  • PDF documents
  • WordPerfect documents
  • E-Mail messages

See the mimetypes module for details on the MIME types corresponding to these formats.

Formats supported by Tika, but not wired up (yet)

  • Electronic Publication Format
  • Compression and packaging formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

See the Supported Document Formats page on the Apache Tika Wiki for details.

Output Formats

  • text/plain

Installation

The preferred method to run the Tika JAX-RS server as a daemon. Although it is possible to run Tika without a daemon (by booting it up for each time a file is converted), the daemon is a lot faster.

Both methods require tika-app.jar to be downloaded and some ZCML configuration for ftw.tika. The daemon method also requires the JAX-RS tika-server.app to be downloaded.

Below are some configuration examples.

Daemon buildout example

See the included tika.cfg for a deamon example that you can adjust as necessary, copy into your buildout and extend from:

[buildout]
parts +=
    tika-app-download
    tika-server-download
    tika-server

[tika]
server-port = 9998
zcml =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika-app-download:destination}/${tika-app-download:filename}"
                     port="${tika:server-port}" />
    </configure>

[tika-app-download]
recipe = hexagonit.recipe.download
url = http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.11/tika-app-1.11.jar
md5sum = c292fbb0b28fbe44f915229afb839db8
download-only = true
filename = tika-app.jar

[tika-server-download]
recipe = hexagonit.recipe.download
url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.11/tika-server-1.11.jar
md5sum = 3c8fb21140213a2f3fbac770358034ab
download-only = true
filename = tika-server.jar

[tika-server]
recipe = collective.recipe.scriptgen
cmd = java
arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${tika:server-port} -includeStack

[instance]
zcml-additional = ${tika:zcml}
eggs += ftw.tika

Note

The -includeStack command line option for the Tika JAXRS server is only available for Tika >= 1.8. If you’re using an older version of Tika, omit it from the arguments. The option will make the Tika JAXRS server return Java stack traces in the response body in case of conversion failures, and therefore allow ftw.tika to provide more detailed error logging.

If your deployment buildout is based on the deployment buildouts included in the ftw-buildouts repository on github, you can simply extend the tika-jaxrs-server.cfg and you have everything configured:

[buildout]
extends =
    https://raw.github.com/4teamwork/ftw-buildouts/master/production.cfg
    https://raw.github.com/4teamwork/ftw-buildouts/master/zeoclients/4.cfg
    https://raw.github.com/4teamwork/ftw-buildouts/master/tika-jaxrs-server.cfg

deployment-number = 05

filestorage-parts =
    www.mywebsite.com

instance-eggs =
    mywebsite

Non-daemon buildout example

Note that running Tika in non-daemon mode is very, very slow!

When you don’t want to use Tika as daemon, you can simply just configure the path to the tika-app.jar in the ftw.tika ZCML configuration and it will fire up tika-app.jar (in a new JVM) every time something needs to be converted.

Here is a short example of how to download the tika-app.jar and configuring ftw.tika with buildout:

[buildout]
parts +=
    tika-app

[tika-app]
recipe = hexagonit.recipe.download
url = http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.11/tika-app-1.11.jar
md5sum = c292fbb0b28fbe44f915229afb839db8
download-only = true
filename = tika-app.jar

[instance]
eggs += ftw.tika
zcml-additional =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika-app:destination}/${tika-app:filename}" />
    </configure>

Different Host buildout example

If you already have a tika server (f.e. docker) you can connect to it without having to install it into the plone instance. Unfortunately if the system run into a timeout it will still try to use the local one as backup. (And produce an error in the log file)

[buildout]

[tika]
server-port = 9998
server-host = myhost
server-timeout = 10
zcml =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config host="${tika:server-host}"
                     port="${tika:server-port}"
                     timeout="${tika:server-timeout}" />
    </configure>

[instance]
zcml-additional = ${tika:zcml}
eggs += ftw.tika

You have the following configuration Options:

  • host: the host where tika is running
  • port: the port of the tika server
  • timeout: you can define the connection timeout of the server in seconds

timeout defaults to 10 seconds and is configurable for your needs. 0 means no timeout at all.

Installing ftw.tika in Plone

  • Install ftw.tika by adding it to the list of eggs in your buildout. (The buildout examples above include adding ftw.tika to the eggs).
[instance]
eggs +=
    ftw.tika
  • Run buildout and start your instance
  • Go to Site Setup of your Plone site and activate the ftw.tika add-on, or depend on the ftw.tika:default profile from your package’s metadata.xml.

Uninstalling ftw.tika

ftw.tika has an uninstall profile. To uninstall ftw.tika, import the ftw.tika:uninstall profile using the portal_setup tool.

Configuration

ftw.tika expects to be provided with a path to an installed tika-app.jar. This can be done through ZCML, and therefore also through buildout.

Configuration in ZCML

The path to the tika-app.jar file must be configured in ZCML.

If you used the supplied tika.cfg as described above, you can reference the download location directly from buildout by using ${tika:destination}/${tika:filename}:

[instance]
zcml-additional =
    <configure xmlns:tika="http://namespaces.plone.org/tika">
        <tika:config path="${tika:destination}/${tika:filename}" />
    </configure>

If you installed Tika yourself, just set path="/path/to/tika" accordingly.

Usage

To use ftw.tika, simply ask the portal_transforms tool for a transformation to text/plain from one of the input formats supported by ftw.tika:

namedfile = self.context.file
transform_tool = getToolByName(self.context, 'portal_transforms')

stream = transform_tool.convertTo(
    'text/plain',
    namedfile.data,
    mimetype=namedfile.contentType)
plain_text = stream and stream.getData() or ''

Caching

If you want the result of the transform to be cached, you’ll need to pass a persistent ZODB object to transform_tool.convertTo() to store the cached result on.

For example, for a NamedBlobFile versioned with CMFEditions you’d use namedfile.data to access the data of the current working copy, and pass namedfile._blob as the object for the cache to be stored on (the namedfile is always the same instance for any version, only the _blob changes):

stream = transform_tool.convertTo(
    'text/plain',
    namedfile.data,
    mimetype=namedfile.contentType,
    object=namedfile._blob)

Stand-alone converter

The code calling Tika is encapsulated in its own class, so if for some reason you don’t want to use the portal_transforms tool, you can also use the converter directly by just instanciating it:

from ftw.tika.converter import TikaConverter

data = StringIO('foo')
converter = TikaConverter(path="/path/to/tika-app.jar")
plain_text = converter.convert(data)

The convert() method accepts either a data string or a file-like stream object. If no path keyword argument is supplied, the converter tries to get the path to the tika-app.jar from the ZCML configuration.

Error logging

In order to get more detailed error logging when using the Tika JAXRS server, you can launch it with the -includeStack command line option and set the environment variable FTW_TIKA_VERBOSE_LOGGING to something truthy.

ftw.tika will then additionally log the output from Tika (which should contain the Java stack trace) in case of a conversion failure, giving you more information as to why the conversion failed.

Changelog

2.7.0 (2016-03-15)

  • Use requests session. [jone]
  • Make connection timeout configurable. [dready]

2.6.0 (2015-10-27)

  • Add support for Tika 1.11 [lgraf]

2.5.0 (2015-10-27)

  • Add support for Tika 1.10 [lgraf]

2.4.0 (2015-10-27)

  • Add support for Tika 1.9 [lgraf]

2.3.0 (2015-10-27)

  • Fall back to local Tika on any RequestException, not just Timeout. [lgraf]
  • Make use of the fact that Tika JAXRS server now can return the Java stack traces in the response body, allowing ftw.tika to provide better error logging in the case of conversion failures (for example, detecting that conversion failed because a document is password protected). [lgraf]
  • Add support for Tika 1.8 [lgraf]

2.2.0 (2015-10-25)

  • Add support for Tika 1.7 [lgraf]

2.1.0 (2015-10-25)

  • Add support for Tika 1.6 [lgraf]

2.0.1 (2014-12-08)

  • Set a default connection timeout of 10s for requests to Tika JAXRS server. [lgraf]

2.0 (2014-11-24)

  • Switch to Tika JAXRS server component (tika-server). [lgraf]

1.1.2 (2014-09-01)

  • Changed tika source to archive.apache.org. [lknoepfel]
  • Extend integration tests to test conversion of all common formats we claim to support. [lgraf]
  • Updated tika to version 1.5. Updated detection of protected office files. [lknoepfel]

1.1.1 (2014-04-01)

  • Only log a warning on protected PDFs / MS Office documents. [jone]

1.1.0 (2014-03-14)

  • Add support for running tika as a deamon. The deamon speeds up the conversion from approximately 1.1 seconds per document to 0.06 seconds. [jone]

1.0 (2013-11-29)

  • First implementation. [lgraf]
Release History

Release History

2.7.0

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.6.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.5.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.4.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.3.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.2.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.0.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

2.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.1.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.1.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

1.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
ftw.tika-2.7.0.tar.gz (171.5 kB) Copy SHA256 Checksum SHA256 Source Mar 15, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting