Skip to main content

Apache Tika Server python client

Project description

PyTika

Workflow status PyPi wheel

An Apache Tika Server python client.

Installation

You can install the package simply from pypi: pip install pytika

Usage

This package is a python client for Apache Tika Server, so you'll need to have Tika Server running locally somehow. I would recommend using the docker image as that's the simplest. docker run --name tika-server -it -d -p 9998:9998 apache/tika:2.7.0-full

After you have that running, you can use PyTika to interface with it.

Metadata queries

from pytika.api import TikaApi

tika = TikaApi()
metadata = api.get_meta(file)

"""
>>> metadata
{
    'Content-Type': 'application/pdf',
    ...
}
"""

Text detection queries

For text detection, Tika Server usually decides on the response type (typically xml/html is the default). To force it to return plain text (Accept: text/plain header) you can set the following configuration:

from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt

tika = TikaApi()

with open("yourfile.whatever", "rb") as file:
    text = api.get_text(file, opt.AsPlainText()).decode()

Notice the awkward configuration - passing a function call as an option - this is coming from a nice Golang standard that makes calling complex APIs a little friendlier. Since we have a lot of options, instead of having each be an argument, we can define an "option class" with chainable functions. This allows the API to validate each separately, avoid having a massive list of arguments for get_text, as well as tidy up the API code. (For more info: Uptrace, Dave Cheney's post)

For detection in HOCR format with bounding boxes:

from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt

tika = TikaApi()

with open("yourfile.whatever", "rb") as file:
    text = api.get_text(file, opt.WithBoundingBoxes()).decode()

There are many more configuration options that you can look into in the GetTextOptionsBuilder class, and more to come in the future.

Contribution Guide

If you'd like to add some missing features that you can find in TikaServer or Tika, then you can contribute to this repo yourself!

1- Clone the repository

git clone "url from repo, either ssh or https"

2- Create a branch

cd pytika
git switch -c your-new-branch-name

3- Make necessary changes and commit, and push to Github

git add README.md
git commit -m "Updated README.md with new API changes"
git push -u origin your-new-branch-name

4- Go to your repository and you'll see a Compare and pull request button, click on that.

5- Wait for us to review your PR, likely leave comments, and hopefully merge it in!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytika-0.2.0.tar.gz (5.5 kB view hashes)

Uploaded Source

Built Distribution

pytika-0.2.0-py3-none-any.whl (5.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page