Skip to main content

Fast text and metadata extraction from documents using Apache Tika compiled to native code

Project description

iscc-tika Python Bindings

This project provides Python bindings for the iscc-tika library, allowing you to use iscc-tika functionality in your Python applications.

Installation

pip install iscc-tika

Usage

Extracting a file to string:

from iscc_tika import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)

Extracting a file(URL / bytearray) to a buffered stream:

from iscc_tika import Extractor

extractor = Extractor()
# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)

Extracting a file with OCR:

from iscc_tika import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
    "../../test_files/documents/eng-ocr.pdf"
)

print(result)
print(metadata)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iscc_tika-0.5.0.tar.gz (214.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

iscc_tika-0.5.0-cp310-abi3-win_amd64.whl (49.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

iscc_tika-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl (50.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

iscc_tika-0.5.0-cp310-abi3-macosx_11_0_arm64.whl (44.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file iscc_tika-0.5.0.tar.gz.

File metadata

  • Download URL: iscc_tika-0.5.0.tar.gz
  • Upload date:
  • Size: 214.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iscc_tika-0.5.0.tar.gz
Algorithm Hash digest
SHA256 54108d38d76d2e7406bee627106274a33a38518613502dc20fa9fcd7be08aee0
MD5 782540913e9cdb0652a8659af8b59231
BLAKE2b-256 4c657d36c6e84aa1b824d73934e9da1a0f4e9b47d7dacf6f967ad8202f6d3e53

See more details on using hashes here.

File details

Details for the file iscc_tika-0.5.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: iscc_tika-0.5.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 49.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for iscc_tika-0.5.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 eb3de4f0915940255a30bd8c14b360c43ac8c71c7e9ab1d370b6cf3743f2bba0
MD5 5b86b35b7bb19424c3d9808a2496a7d5
BLAKE2b-256 e34d34084aa952365669aa8103a950a3d7d3b1de6f0cb3be1c5e80ca36ca312c

See more details on using hashes here.

File details

Details for the file iscc_tika-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iscc_tika-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2a9a1dd4807494ae75c8cfb68d848114527475dc880fdd94ceb2e32120376016
MD5 056adae50a1a31dd90a521a2945a1623
BLAKE2b-256 ce72bf788efdc0d6a4e3a202b1e1e7f935151cd0da7539fa40af34bb937ab15c

See more details on using hashes here.

File details

Details for the file iscc_tika-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for iscc_tika-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2ad34f1997446feade97ffd03ce4f7018a644ff14f31bf07b4755c1eb6eb5c5
MD5 9220d9f220c8dfa03b40afb4b6682457
BLAKE2b-256 2cdc8966769543a3127e86805e65b88f9ecd174a63e2e04a926a3b0b0e092d26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page