Python client for Apache Tika App
Project description
PyPI version Build Status Coverage Status BCH compliance
tika-app-python
Overview
tika-app-python is a wrapper for Apache Tika App. With this library you can analyze: - file on disk - payload in base64 - file object (like standard input)
To use file object function you should use Apache Tika version >= 1.17.
Apache 2 Open Source License
tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.
Installation
Clone repository
git clone https://github.com/fedelemantuano/tika-app-python.git
and install tika-app-python with setup.py:
cd tika-app-python python setup.py install
or use pip:
pip install tika-app
Usage in a project
Import TikaApp class:
from tikapp import TikaApp tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")
For get content type:
tika_client.detect_content_type("your_file")
For detect language:
tika_client.detect_language("your_file")
For detect all metadata and content:
tika_client.extract_all_content("your_file")
For detect only content:
tika_client.extract_only_content("your_file")
For detect only metadata:
tika_client.extract_only_metadata("your_file")
You can analyze payload in base64 with the same methods, but passing payload argument:
tika_client.detect_content_type(payload="base64_payload") tika_client.detect_language(payload="base64_payload") tika_client.extract_all_content(payload="base64_payload") tika_client.extract_only_content(payload="base64_payload") tika_client.extract_only_metadata(payload="base64_payload")
or you can analyze file object (like standard input) with the same methods, but passing objectInput argument:
tika_client.detect_language(objectInput="objectInput") tika_client.extract_all_content(objectInput="objectInput") tika_client.extract_only_content(objectInput="objectInput") tika_client.extract_only_metadata(objectInput="objectInput")
Usage from command-line
If you installed tika-app-python with pip or setup.py you can use it with command-line. To use tika-app-python you should submit the Apache Tika app JAR. You can: - set the enviroment value TIKA_APP_JAR - use --jar switch
The last one overwrite all the others.
These are all swithes:
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l] [-m] [-a] [-v] Wrapper for Apache Tika App. optional arguments: -h, --help show this help message and exit -f FILE, --file FILE File to submit (default: None) -p PAYLOAD, --payload PAYLOAD Base64 payload to submit (default: None) -k, --stdin Enable parsing from stdin (default: False) -j JAR, --jar JAR Apache Tika app JAR (default: None) -d, --detect Detect document type (default: False) -t, --text Output plain text content (default: False) -l, --language Output only language (default: False) -m, --metadata Output only metadata (default: False) -a, --all Output metadata and content from all embedded files (default: False) -v, --version show program's version number and exit
Example from file on disk:
$ tikapp -f example_file -a
Example from standard input
$ tikapp -a -k < example_file
Performance tests
These are the results of performance tests in tests folder:
(Python 2) tika_content_type() 0.704840 sec tika_detect_language() 1.592066 sec magic_content_type() 0.000215 sec tika_extract_all_content() 0.816366 sec tika_extract_only_content() 0.788667 sec (Python 3) tika_content_type() 0.698357 sec tika_detect_language() 1.593452 sec magic_content_type() 0.000226 sec tika_extract_all_content() 0.785915 sec tika_extract_only_content() 0.766517 sec
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tika-app-1.5.0.tar.gz
.
File metadata
- Download URL: tika-app-1.5.0.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/2.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a9735fc602c13a61a12b99a2b346fe16aa76aba2443c2fec677f97e9dfb6294 |
|
MD5 | 2fa7ba7ee9a32d94b78e1b9fddf977bf |
|
BLAKE2b-256 | 9d35a8d799c0dea262181a936dd8c6d8e6095a943bfccf147a16c4e19a101a32 |