The metadata and text content extractor for almost every file type.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

baughmann

These details have not been verified by PyPI

Project description

Tikara

Coverage Tests PyPI GitHub License PyPI - Downloads GitHub issues GitHub pull requests GitHub stars

🚀 Overview

Tikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.

from tikara import Tika

tika = Tika()
content, metadata = tika.parse("document.pdf")

⚡️ Key Features

Modern Python 3.12+ with complete type hints
Direct JVM integration via JPype (no HTTP server required)
Streaming support for large files
Recursive document unpacking
Language detection
MIME type detection
Custom parser and detector support
Comprehensive metadata extraction
Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.
Opinionated Pydantic wrapper over Tika's metadata model, with access to the raw metadata.

📦 Supported Formats

🌈 1682 supported media types and counting!

🛠️ Installation

pip install tikara

System Dependencies

Required Dependencies

Python 3.12+
Java Development Kit 11+ (OpenJDK recommended)

Optional Dependencies

Image and PDF OCR Enhancements (recommended)

Tesseract OCR (strongly recommended if you process images) (Reference ⇗)

# Ubuntu
apt-get install tesseract-ocr

Additional language packs for Tesseract (optional):

# Ubuntu
apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spa

ImageMagick for advanced image processing (Reference ⇗)
```
# Ubuntu
apt-get install imagemagick
```

Multimedia Enhancements (recommended)

FFMPEG for enhanced multimedia file support (Reference ⇗)
```
# Ubuntu
apt-get install ffmpeg
```

Enhanced PDF Support (recommended)

PDFBox ⇗ for enhanced PDF support (Reference ⇗)
```
# Ubuntu
apt-get install pdfbox
```

Enhanced PDF support with PDFBox Reference ⇗

Metadata Enhancements (recommended)

EXIFTool for metadata extraction from images Reference ⇗
```
# Ubuntu
apt-get install libimage-exiftool-perl
```

Geospatial Enhancements

GDAL for geospatial file support (Reference ⇗)
```
# Ubuntu
apt-get install gdal-bin
```

Additional Font Support (recommended)

MSCore Fonts for enhanced Office file handling (Reference ⇗)

# Ubuntu
apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer

For more OS dependency information including MSCore fonts setup and additional configuration, see the official Apache Tika Dockerfile.

📖 Usage

Example Jupyter Notebooks 📔

Basic Content Extraction

from tikara import Tika
from pathlib import Path

tika = Tika()

# Basic string output
content, metadata = tika.parse("document.pdf")

# Stream large files
stream, metadata = tika.parse(
    "large.pdf",
    output_stream=True,
    output_format="txt"
)

# Save to file
output_path, metadata = tika.parse(
    "input.docx",
    output_file=Path("output.txt"),
    output_format="txt"
)

Language Detection

from tikara import Tika

tika = Tika()
result = tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso")
print(f"Language: {result.language}, Confidence: {result.confidence}")

MIME Type Detection

from tikara import Tika

tika = Tika()
mime_type = tika.detect_mime_type("unknown_file")
print(f"Detected type: {mime_type}")

Recursive Document Unpacking

from tikara import Tika
from pathlib import Path

tika = Tika()
results = tika.unpack(
    "container.docx",
    output_dir=Path("extracted"),
    max_depth=3
)

for item in results:
    print(f"Extracted {item.metadata['Content-Type']} to {item.file_path}")

🔧 Development

Environment Setup

Ensure that you have the system dependencies installed

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install python dependencies and create the Virtual Environment:
```
make install
```

Common Tasks

Run make (or make help) to see all available targets. The most common ones:

# Setup
make install         # Install all dependencies (including dev)
make stubs           # Regenerate Java type stubs from the Tika JAR

# Lint & Format
make lint            # Run ruff linter (with auto-fix)
make format          # Run ruff formatter
make ruff            # Run linter and formatter together

# Test
make test            # Run tests with verbose output
make test-fast       # Run tests, skip slow benchmark/isolated markers
make test-coverage   # Run tests with coverage report (XML + terminal)

# Docs
make docs            # Build Sphinx HTML docs
make docs-open       # Build docs and open in browser

# Security
make safety          # Run safety dependency vulnerability scan

# Build & Release
make build           # Build sdist and wheel
make clean           # Remove build artifacts, caches, and generated reports

# CI / Pre-push
make ci              # Run full CI suite (lint → test → safety → docs)
make prepush         # Alias for ci — run before pushing

🤔 When to Use Tikara

Ideal Use Cases

Python applications needing document processing
Microservices and containerized environments
Data processing pipelines (Ray, Dask, Prefect)
Applications requiring direct Tika integration without HTTP overhead

Advanced Usage

For detailed documentation on:

Custom parser implementation
Custom detector creation
MIME type handling

See the Example Jupyter Notebooks 📔

🎯 Inspiration

Tikara builds on the shoulders of giants:

Apache Tika - The powerful content detection and extraction toolkit
tika-python - The original Python Tika wrapper using HTTP that inspired this project
JPype - The bridge between Python and Java

Considerations

Process isolation: Tika crashes will affect the host application
Memory management: Large documents require careful handling
JVM startup: Initial overhead for first operation
Custom implementations: Parser/detector development requires Java interface knowledge

📊 Performance Considerations

Memory Management

Use streaming for large files
Monitor JVM heap usage
Consider process isolation for critical applications

Optimization Tips

Reuse Tika instances
Use appropriate output formats
Implement custom parsers for specific needs
Configure JVM parameters for your use case

🔐 Security Considerations

Input validation
Resource limits
Secure file handling
Access control for extracted content
Careful handling of custom parsers

🤝 Contributing

Contributions welcome! The project uses Make for development tasks:

make prepush     # Run full CI suite (lint, test, coverage, safety, docs)

For developing custom parsers/detectors, Java stubs can be generated:

make stubs       # Generate Java stubs for Apache Tika interfaces

Note: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.

Common Problems

Verify Java installation and JAVA_HOME environment variable
Ensure Tesseract and required language packs are installed
Check file permissions and paths
Monitor memory usage when processing large files
Use streaming output for large documents

📚 Reference

See API Documentation for complete details.

📄 License

Apache License 2.0 - See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

baughmann

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.7

Apr 4, 2026

0.1.6

Jan 28, 2025

0.1.5.post1

Jan 27, 2025

0.1.5

Jan 26, 2025

0.1.4

Jan 25, 2025

0.1.3

Jan 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tikara-0.1.7.tar.gz (50.4 MB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tikara-0.1.7-py3-none-any.whl (50.4 MB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file tikara-0.1.7.tar.gz.

File metadata

Download URL: tikara-0.1.7.tar.gz
Upload date: Apr 4, 2026
Size: 50.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tikara-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`8807efdb417969a857e509bcc070ab03a8e85aa9c8e181ece71843ca8b370446`
MD5	`d8c69976ef7878811a90dda42be8c926`
BLAKE2b-256	`6e29feacec582b589e9d2c07ede6172d0948c34b88257df6a9a9a771d8867b5c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tikara-0.1.7.tar.gz:

Publisher: release.yml on baughmann/tikara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tikara-0.1.7.tar.gz
- Subject digest: 8807efdb417969a857e509bcc070ab03a8e85aa9c8e181ece71843ca8b370446
- Sigstore transparency entry: 1235312230
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: baughmann/tikara@e922498b77e0c24c650935b2b1283e3dc5ddbf18
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/baughmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e922498b77e0c24c650935b2b1283e3dc5ddbf18
- Trigger Event: push

File details

Details for the file tikara-0.1.7-py3-none-any.whl.

File metadata

Download URL: tikara-0.1.7-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 50.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tikara-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65a905de49f0482a94cb95c157edf195ef19b56254fe633551f0eb056905f21d`
MD5	`ada41e14db7b809090d8d6774bcd0978`
BLAKE2b-256	`30c5ee72dce9166a31043da889a706cafa737d8b10c4e9020a50395515071d62`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tikara-0.1.7-py3-none-any.whl:

Publisher: release.yml on baughmann/tikara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tikara-0.1.7-py3-none-any.whl
- Subject digest: 65a905de49f0482a94cb95c157edf195ef19b56254fe633551f0eb056905f21d
- Sigstore transparency entry: 1235312275
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: baughmann/tikara@e922498b77e0c24c650935b2b1283e3dc5ddbf18
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/baughmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e922498b77e0c24c650935b2b1283e3dc5ddbf18
- Trigger Event: push

tikara 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Tikara

🚀 Overview

⚡️ Key Features

📦 Supported Formats

🛠️ Installation

System Dependencies

Required Dependencies

Optional Dependencies

Image and PDF OCR Enhancements (recommended)

Multimedia Enhancements (recommended)

Enhanced PDF Support (recommended)

Metadata Enhancements (recommended)

Geospatial Enhancements

Additional Font Support (recommended)

📖 Usage

Basic Content Extraction

Language Detection

MIME Type Detection

Recursive Document Unpacking

🔧 Development

Environment Setup

Common Tasks

🤔 When to Use Tikara

Ideal Use Cases

Advanced Usage

🎯 Inspiration

Considerations

📊 Performance Considerations

Memory Management

Optimization Tips

🔐 Security Considerations

🤝 Contributing

Common Problems

📚 Reference

📄 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance