Python client for Apache Tika App
Project description
tika-app-python
Overview
tika-app-python is a wrapper for Apache Tika App.
Apache 2 Open Source License
tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.
Installation
Clone repository
git clone https://github.com/fedelemantuano/tika-app-python.git
and install tika-app-python with setup.py:
cd tika-app-python python setup.py install
or use pip:
pip install tika-app
Usage in a project
Import TikaApp class:
from tikapp import TikaApp tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.14.jar")
For get content type:
tika_client.detect_content_type("your_file")
For detect language:
tika_client.detect_language("your_file")
For detect all metadata and content:
tika_client.extract_all_content("your_file")
For detect only content:
tika_client.extract_only_content("your_file")
If you want to use payload in base64, you can use the same methods with payload argument:
tika_client.detect_content_type(payload="base64_payload") tika_client.detect_language(payload="base64_payload") tika_client.extract_all_content(payload="base64_payload") tika_client.extract_only_content(payload="base64_payload")
Usage from command-line
If you installed tika-app-python with pip or setup.py you can use it with command-line. To use tika-app-python you should submit the Apache Tika app JAR. You can: - leave the default value: /opt/tika/tika-app-1.14.jar - set the enviroment value TIKA_APP_JAR - use --jar switch
The last one overwrite all the others.
These are all swithes:
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a] [-v] Wrapper for Apache Tika App. optional arguments: -h, --help show this help message and exit -f FILE, --file FILE File to submit (default: None) -p PAYLOAD, --payload PAYLOAD Base64 payload to submit (default: None) -j JAR, --jar JAR Apache Tika app JAR (default: None) -d, --detect Detect document type (default: False) -t, --text Output plain text content (default: False) -l, --language Output only language (default: False) -a, --all Output metadata and content from all embedded files (default: False) -v, --version show program's version number and exit
Example:
```shell $ tikapp -f example_file -a
Performance tests
These are the results of performance tests in tests folder:
(Python 2) tika_content_type() 0.704840 sec tika_detect_language() 1.592066 sec magic_content_type() 0.000215 sec tika_extract_all_content() 0.816366 sec tika_extract_only_content() 0.788667 sec (Python 3) tika_content_type() 0.698357 sec tika_detect_language() 1.593452 sec magic_content_type() 0.000226 sec tika_extract_all_content() 0.785915 sec tika_extract_only_content() 0.766517 sec
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.