Skip to main content

A modern REST client for Apache Tika server

Project description

Tika Rest Client

PyPI - Version PyPI - Python Version codecov


Table of Contents

Features

  • Simplified: No need to worry about XML or JSON responses, downloading a Tika jar file or Python 2
  • Support for Tika 2+ only (including Tika v3, which didn't change the API)
  • Based on the modern httpx library
  • Full support for type hinting
  • Nearly full test coverage run against an actual Tika server for multiple Python and PyPy versions
  • Uses HTTP multipart/form-data to stream files to the server (instead of reading into memory)
  • Optional compression for parsing from a file content already in a buffer (as opposed to a file)

Installation

pip3 install tika-client

Usage

from pathlib import Path
from tika_client import TikaClient

test_file = Path("sample.docx")


with TikaClient("http://localhost:9998") as client:

    # Extract a document's metadata
    metadata = client.metadata.from_file(test_file)

    # Get the content of a document as HTML
    data = client.tika.as_html.from_file(test_file)

    # Or as plain text
    text = client.tika.as_text.from_file(test_file)

    # Content and metadata combined
    data = client.rmeta.as_text.from_file(test_file)

    # The mime type can also be given
    # This allows Content-Type to be set most accurately
    text = client.tika.as_text.from_file(test_file,
                                         "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

The Tika REST API documentation can be found here. At the moment, only the metadata, tika and recursive metadata endpoints are implemented.

Unfortunately, the set of possible return values of the Tika API are not very well documented. The library makes a best effort to extract relevant fields into type properties where it understands more about the mime type of the document (as returned by Tika). This includes information like created/modified information as time zone aware datetime objects. The full JSON response is always available to the user under the .data attribute.

When a particular key is not present in the response, all properties will return None instead.

Why

Only one other library for interfacing with Tika exists that I know of. I find it too complicated, trying to handle a lot of differing uses.

The biggest issue I have with the library is its downloading and running of a jar file if needed. To me, an API client should only interface to the API and not try to provide functionality to start the API as well. The user is responsible for providing the server with the Tika version they desire.

The library also provides a lot of knobs to turn, but I argue most developers will not want to configure XML as the response type, they just want the data, already parsed to the maximum extend possible.

This library attempts to provide a simpler interface, minimal lines of code and typing of the parsed response.

License

tika-client is distributed under the terms of the Mozilla Public License 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tika_client-0.10.0.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tika_client-0.10.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file tika_client-0.10.0.tar.gz.

File metadata

  • Download URL: tika_client-0.10.0.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tika_client-0.10.0.tar.gz
Algorithm Hash digest
SHA256 3101e8b2482ae4cb7f87be13ada970ff691bdc3404d94cd52f5e57a09c99370c
MD5 f7b31e0479ecacf775096c652606cf9f
BLAKE2b-256 21be65bfc47e4689ecd5ead20cf47dc0084fd767b7e71e8cfabf5fddc42aae3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tika_client-0.10.0.tar.gz:

Publisher: ci.yml on stumpylog/tika-client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tika_client-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: tika_client-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tika_client-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5486cc884e4522575662aa295bda761bf9f101ac8d92840155b58ab8b96f6e2
MD5 fa0e0ff5905c733a167feb5e6130d317
BLAKE2b-256 b131002e0fa5bca67d6a19da8c294273486f6c46cbcc83d6879719a38a181461

See more details on using hashes here.

Provenance

The following attestation bundles were made for tika_client-0.10.0-py3-none-any.whl:

Publisher: ci.yml on stumpylog/tika-client

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page