Skip to main content

The metadata and text content extractor for almost every file type.

Project description

Tikara

Tikara Logo

Coverage Tests PyPI GitHub License PyPI - Downloads GitHub issues GitHub pull requests GitHub stars

🚀 Overview

Tikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.

from tikara import Tika

tika = Tika()
content, metadata = tika.parse("document.pdf")

⚡️ Key Features

  • Modern Python 3.12+ with complete type hints
  • Direct JVM integration via JPype (no HTTP server required)
  • Streaming support for large files
  • Recursive document unpacking
  • Language detection
  • MIME type detection
  • Custom parser and detector support
  • Comprehensive metadata extraction
  • Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.
  • Opinionated Pydantic wrapper over Tika's metadata model, with access to the raw metadata.

📦 Supported Formats

🌈 1682 supported media types and counting!

🛠️ Installation

pip install tikara

System Dependencies

Required Dependencies

  • Python 3.12+
  • Java Development Kit 11+ (OpenJDK recommended)

Optional Dependencies

Image and PDF OCR Enhancements (recommended)
  • Tesseract OCR (strongly recommended if you process images) (Reference ⇗)

    # Ubuntu
    apt-get install tesseract-ocr
    

    Additional language packs for Tesseract (optional):

    # Ubuntu
    apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spa
    
  • ImageMagick for advanced image processing (Reference ⇗)

    # Ubuntu
    apt-get install imagemagick
    
Multimedia Enhancements (recommended)
  • FFMPEG for enhanced multimedia file support (Reference ⇗)

    # Ubuntu
    apt-get install ffmpeg
    
Enhanced PDF Support (recommended)

Enhanced PDF support with PDFBox Reference ⇗

Metadata Enhancements (recommended)
  • EXIFTool for metadata extraction from images Reference ⇗

    # Ubuntu
    apt-get install libimage-exiftool-perl
    
Geospatial Enhancements
  • GDAL for geospatial file support (Reference ⇗)

    # Ubuntu
    apt-get install gdal-bin
    
Additional Font Support (recommended)
  • MSCore Fonts for enhanced Office file handling (Reference ⇗)

    # Ubuntu
    apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer
    

For more OS dependency information including MSCore fonts setup and additional configuration, see the official Apache Tika Dockerfile.

📖 Usage

Example Jupyter Notebooks 📔

Basic Content Extraction

from tikara import Tika
from pathlib import Path

tika = Tika()

# Basic string output
content, metadata = tika.parse("document.pdf")

# Stream large files
stream, metadata = tika.parse(
    "large.pdf",
    output_stream=True,
    output_format="txt"
)

# Save to file
output_path, metadata = tika.parse(
    "input.docx",
    output_file=Path("output.txt"),
    output_format="txt"
)

Language Detection

from tikara import Tika

tika = Tika()
result = tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso")
print(f"Language: {result.language}, Confidence: {result.confidence}")

MIME Type Detection

from tikara import Tika

tika = Tika()
mime_type = tika.detect_mime_type("unknown_file")
print(f"Detected type: {mime_type}")

Recursive Document Unpacking

from tikara import Tika
from pathlib import Path

tika = Tika()
results = tika.unpack(
    "container.docx",
    output_dir=Path("extracted"),
    max_depth=3
)

for item in results:
    print(f"Extracted {item.metadata['Content-Type']} to {item.file_path}")

🔧 Development

Environment Setup

  1. Ensure that you have the system dependencies installed

  2. Install uv:

    pip install uv
    
  3. Install python dependencies and create the Virtual Environment: uv sync

Common Tasks

make ruff        # Format and lint code
make test        # Run test suite
make docs        # Generate documentation
make stubs       # Generate Java stubs
make prepush     # Run all checks (ruff, test, coverage, safety)

🤔 When to Use Tikara

Ideal Use Cases

  • Python applications needing document processing
  • Microservices and containerized environments
  • Data processing pipelines (Ray, Dask, Prefect)
  • Applications requiring direct Tika integration without HTTP overhead

Advanced Usage

For detailed documentation on:

  • Custom parser implementation
  • Custom detector creation
  • MIME type handling

See the Example Jupyter Notebooks 📔

🎯 Inspiration

Tikara builds on the shoulders of giants:

  • Apache Tika - The powerful content detection and extraction toolkit
  • tika-python - The original Python Tika wrapper using HTTP that inspired this project
  • JPype - The bridge between Python and Java

Considerations

  • Process isolation: Tika crashes will affect the host application
  • Memory management: Large documents require careful handling
  • JVM startup: Initial overhead for first operation
  • Custom implementations: Parser/detector development requires Java interface knowledge

📊 Performance Considerations

Memory Management

  • Use streaming for large files
  • Monitor JVM heap usage
  • Consider process isolation for critical applications

Optimization Tips

  • Reuse Tika instances
  • Use appropriate output formats
  • Implement custom parsers for specific needs
  • Configure JVM parameters for your use case

🔐 Security Considerations

  • Input validation
  • Resource limits
  • Secure file handling
  • Access control for extracted content
  • Careful handling of custom parsers

🤝 Contributing

Contributions welcome! The project uses Make for development tasks:

make prepush     # Run all checks (format, lint, test, coverage, safety)

For developing custom parsers/detectors, Java stubs can be generated:

make stubs       # Generate Java stubs for Apache Tika interfaces

Note: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.

Common Problems

  • Verify Java installation and JAVA_HOME environment variable
  • Ensure Tesseract and required language packs are installed
  • Check file permissions and paths
  • Monitor memory usage when processing large files
  • Use streaming output for large documents

📚 Reference

See API Documentation for complete details.

📄 License

Apache License 2.0 - See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tikara-0.1.6.tar.gz (49.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tikara-0.1.6-py3-none-any.whl (49.7 MB view details)

Uploaded Python 3

File details

Details for the file tikara-0.1.6.tar.gz.

File metadata

  • Download URL: tikara-0.1.6.tar.gz
  • Upload date:
  • Size: 49.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tikara-0.1.6.tar.gz
Algorithm Hash digest
SHA256 88a53de1da6ae14032c11ee3a1b5ef9259082307ea5defd6edc8b98548ccad66
MD5 e4695e475d8ce5ceaf4292ad7ae40698
BLAKE2b-256 2984936fc217908161f20b06648b90d8699d7690725e4e0ee3d5d8a5791a3229

See more details on using hashes here.

Provenance

The following attestation bundles were made for tikara-0.1.6.tar.gz:

Publisher: release.yml on baughmann/tikara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tikara-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: tikara-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 49.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for tikara-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 05bef928a794bda44aa1d335e5d9f67111a868f3c41375ac584b348b883b8aef
MD5 291ee90c415d395836a2cb7e2f59debc
BLAKE2b-256 36fef22f3e4029057a1faeba12cbdd1e51dccefcb9f60433e01734e07e3bb6e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for tikara-0.1.6-py3-none-any.whl:

Publisher: release.yml on baughmann/tikara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page