Skip to main content

Apache Tika Server python client

Project description

PyTika

Workflow status PyPi wheel

An Apache Tika Server python client.

Installation

You can install the package simply from pypi: pip install pytika

Usage

This package is a python client for Apache Tika Server, so you'll need to have Tika Server running locally somehow. I would recommend using the docker image as that's the simplest. docker run --name tika-server -it -d -p 9998:9998 apache/tika:2.7.0-full

After you have that running, you can use PyTika to interface with it.

Metadata queries

from pytika.api import TikaApi

tika = TikaApi()
metadata = api.get_meta(file)

"""
>>> metadata
{
    'Content-Type': 'application/pdf',
    ...
}
"""

Text detection queries

For text detection, Tika Server usually decides on the response type (typically xml/html is the default). To force it to return plain text (Accept: text/plain header) you can set the following configuration:

from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt

tika = TikaApi()

with open("yourfile.whatever", "rb") as file:
    text = api.get_text(file, opt.AsPlainText()).decode()

Notice the awkward configuration - passing a function call as an option - this is coming from a nice Golang standard that makes calling complex APIs a little friendlier. Since we have a lot of options, instead of having each be an argument, we can define an "option class" with chainable functions. This allows the API to validate each separately, avoid having a massive list of arguments for get_text, as well as tidy up the API code. (For more info: Uptrace, Dave Cheney's post)

For detection in HOCR format with bounding boxes:

from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt

tika = TikaApi()

with open("yourfile.whatever", "rb") as file:
    text = api.get_text(file, opt.WithBoundingBoxes()).decode()

There are many more configuration options that you can look into in the GetTextOptionsBuilder class, and more to come in the future.

Contribution Guide

If you'd like to add some missing features that you can find in TikaServer or Tika, then you can contribute to this repo yourself!

1- Clone the repository

git clone "url from repo, either ssh or https"

2- Create a branch

cd pytika
git switch -c your-new-branch-name

3- Make necessary changes and commit, and push to Github

git add README.md
git commit -m "Updated README.md with new API changes"
git push -u origin your-new-branch-name

4- Go to your repository and you'll see a Compare and pull request button, click on that.

5- Wait for us to review your PR, likely leave comments, and hopefully merge it in!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytika-0.2.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

pytika-0.2.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file pytika-0.2.0.tar.gz.

File metadata

  • Download URL: pytika-0.2.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.9 Darwin/22.2.0

File hashes

Hashes for pytika-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4c473945da5d6d97cea1aab9c7156e57d1054dfb64c7c5597f326c6a26590b62
MD5 7225ef188b82a184631b9db970ce02d5
BLAKE2b-256 f1a3c10538eeb6ce49f5ea08c87c55c1d7aab66e59d1c5ec28fb582324d9f7fa

See more details on using hashes here.

File details

Details for the file pytika-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pytika-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.9 Darwin/22.2.0

File hashes

Hashes for pytika-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 52376167b0f0cb252f71413f03e2dd673e68c5318bfa122b17ce3881ac96a483
MD5 94957a75f05f49955f4a29f80bde3a1d
BLAKE2b-256 bc701eef6755ca31948478b8754050e733b723ff18b8ee1d958e52ed5bc4b934

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page