Skip to main content

Fast text and metadata extraction from documents using Apache Tika compiled to native code

Project description

iscc-tika Python Bindings

This project provides Python bindings for the iscc-tika library, allowing you to use iscc-tika functionality in your Python applications.

Installation

pip install iscc-tika

Usage

Extracting a file to string:

from iscc_tika import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)

Extracting a file(URL / bytearray) to a buffered stream:

from iscc_tika import Extractor

extractor = Extractor()
# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)

Extracting a file with OCR:

from iscc_tika import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
    "../../test_files/documents/eng-ocr.pdf"
)

print(result)
print(metadata)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iscc_tika-0.4.0.tar.gz (208.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

iscc_tika-0.4.0-cp310-abi3-win_amd64.whl (41.9 MB view details)

Uploaded CPython 3.10+Windows x86-64

iscc_tika-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl (43.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

iscc_tika-0.4.0-cp310-abi3-macosx_11_0_arm64.whl (48.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file iscc_tika-0.4.0.tar.gz.

File metadata

  • Download URL: iscc_tika-0.4.0.tar.gz
  • Upload date:
  • Size: 208.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for iscc_tika-0.4.0.tar.gz
Algorithm Hash digest
SHA256 49b7b409343489a14eb2d3085fc39eb850aa5ffae4fbde652c7c0a6e58374d6e
MD5 5155e9203abc4e7d3456f04f19a580cd
BLAKE2b-256 cf2fdf546b83ebaf8e3c2db0d9e24a7d07623cb0b0cca2b06023e21c54528c8e

See more details on using hashes here.

File details

Details for the file iscc_tika-0.4.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: iscc_tika-0.4.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 41.9 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for iscc_tika-0.4.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9a5dbf8342eb5b799f114014ee0edbfb89eebe45b9454c54be9233e0b1aba12e
MD5 6e8a7de5564b9ca25012c8518fd4cc49
BLAKE2b-256 d7af391ff7d2a9621160aa293e607eeb730b0b07821962e86642a42109a59cce

See more details on using hashes here.

File details

Details for the file iscc_tika-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iscc_tika-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 78b1c62621a46062d1e9a54fa7c803e08788d8d67d23e6dac5cc3776affbc302
MD5 b7a6740ede67ed4f14789f52feb2576b
BLAKE2b-256 3507b21caf76a2b2cf0b8baf9d992e6564762b58ae9589f73f1fc735ce8c6221

See more details on using hashes here.

File details

Details for the file iscc_tika-0.4.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for iscc_tika-0.4.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 352e436f17cb8b3979ccc8635686d8ef1083f9b5197a1f815337e6fa8e421fbb
MD5 e7b8f1cbb26b8d97fad80165f3bd7892
BLAKE2b-256 feb9b50e443937287b23e4672cb5d53f59d5f77dfcc12adaaebd98c69cb86853

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page