Apache Tika Server python client
Project description
PyTika
An Apache Tika Server python client.
Installation
You can install the package simply from pypi:
pip install pytika
Usage
This package is a python client for Apache Tika Server, so you'll need to have Tika Server running locally somehow.
I would recommend using the docker image as that's the simplest.
docker run --name tika-server -it -d -p 9998:9998 apache/tika:2.7.0-full
After you have that running, you can use PyTika to interface with it.
Metadata queries
from pytika.api import TikaApi
tika = TikaApi()
metadata = api.get_meta(file)
"""
>>> metadata
{
'Content-Type': 'application/pdf',
...
}
"""
Text detection queries
For text detection, Tika Server usually decides on the response type (typically xml/html is the default). To force it to return plain text (Accept: text/plain header) you can set the following configuration:
from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt
tika = TikaApi()
with open("yourfile.whatever", "rb") as file:
text = api.get_text(file, opt.AsPlainText()).decode()
Notice the awkward configuration - passing a function call as an option - this is coming from a nice Golang standard that makes calling complex APIs a little friendlier. Since we have a lot of options, instead of having each be an argument, we can define an "option class" with chainable functions. This allows the API to validate each separately, avoid having a massive list of arguments for get_text, as well as tidy up the API code. (For more info: Uptrace, Dave Cheney's post)
For detection in HOCR format with bounding boxes:
from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt
tika = TikaApi()
with open("yourfile.whatever", "rb") as file:
text = api.get_text(file, opt.WithBoundingBoxes()).decode()
There are many more configuration options that you can look into in the GetTextOptionsBuilder class, and more to come in the future.
Contribution Guide
If you'd like to add some missing features that you can find in TikaServer or Tika, then you can contribute to this repo yourself!
1- Clone the repository
git clone "url from repo, either ssh or https"
2- Create a branch
cd pytika
git switch -c your-new-branch-name
3- Make necessary changes and commit, and push to Github
git add README.md
git commit -m "Updated README.md with new API changes"
git push -u origin your-new-branch-name
4- Go to your repository and you'll see a Compare and pull request
button, click on that.
5- Wait for us to review your PR, likely leave comments, and hopefully merge it in!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytika-0.2.0.tar.gz
.
File metadata
- Download URL: pytika-0.2.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.9 Darwin/22.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c473945da5d6d97cea1aab9c7156e57d1054dfb64c7c5597f326c6a26590b62 |
|
MD5 | 7225ef188b82a184631b9db970ce02d5 |
|
BLAKE2b-256 | f1a3c10538eeb6ce49f5ea08c87c55c1d7aab66e59d1c5ec28fb582324d9f7fa |
File details
Details for the file pytika-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: pytika-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.9 Darwin/22.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52376167b0f0cb252f71413f03e2dd673e68c5318bfa122b17ce3881ac96a483 |
|
MD5 | 94957a75f05f49955f4a29f80bde3a1d |
|
BLAKE2b-256 | bc701eef6755ca31948478b8754050e733b723ff18b8ee1d958e52ed5bc4b934 |