The metadata and text content extractor for almost every file type.
Project description
Tikara
🚀 Overview
Tikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.
from tikara import Tika
tika = Tika()
content, metadata = tika.parse("document.pdf")
⚡️ Key Features
- Modern Python 3.12+ with complete type hints
- Direct JVM integration via JPype (no HTTP server required)
- Streaming support for large files
- Recursive document unpacking
- Language detection
- MIME type detection
- Custom parser and detector support
- Comprehensive metadata extraction
- Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.
- Opinionated Pydantic wrapper over Tika's metadata model, with access to the raw metadata.
📦 Supported Formats
🌈 1682 supported media types and counting!
🛠️ Installation
pip install tikara
System Dependencies
Required Dependencies
- Python 3.12+
- Java Development Kit 11+ (OpenJDK recommended)
Optional Dependencies
Image and PDF OCR Enhancements (recommended)
-
Tesseract OCR (strongly recommended if you process images) (Reference ⇗)
# Ubuntu apt-get install tesseract-ocr
Additional language packs for Tesseract (optional):
# Ubuntu apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spa
-
ImageMagick for advanced image processing (Reference ⇗)
# Ubuntu apt-get install imagemagick
Multimedia Enhancements (recommended)
-
FFMPEG for enhanced multimedia file support (Reference ⇗)
# Ubuntu apt-get install ffmpeg
Enhanced PDF Support (recommended)
-
PDFBox ⇗ for enhanced PDF support (Reference ⇗)
# Ubuntu apt-get install pdfbox
Enhanced PDF support with PDFBox Reference ⇗
Metadata Enhancements (recommended)
-
EXIFTool for metadata extraction from images Reference ⇗
# Ubuntu apt-get install libimage-exiftool-perl
Geospatial Enhancements
-
GDAL for geospatial file support (Reference ⇗)
# Ubuntu apt-get install gdal-bin
Additional Font Support (recommended)
-
MSCore Fonts for enhanced Office file handling (Reference ⇗)
# Ubuntu apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer
For more OS dependency information including MSCore fonts setup and additional configuration, see the official Apache Tika Dockerfile.
📖 Usage
Basic Content Extraction
from tikara import Tika
from pathlib import Path
tika = Tika()
# Basic string output
content, metadata = tika.parse("document.pdf")
# Stream large files
stream, metadata = tika.parse(
"large.pdf",
output_stream=True,
output_format="txt"
)
# Save to file
output_path, metadata = tika.parse(
"input.docx",
output_file=Path("output.txt"),
output_format="txt"
)
Language Detection
from tikara import Tika
tika = Tika()
result = tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso")
print(f"Language: {result.language}, Confidence: {result.confidence}")
MIME Type Detection
from tikara import Tika
tika = Tika()
mime_type = tika.detect_mime_type("unknown_file")
print(f"Detected type: {mime_type}")
Recursive Document Unpacking
from tikara import Tika
from pathlib import Path
tika = Tika()
results = tika.unpack(
"container.docx",
output_dir=Path("extracted"),
max_depth=3
)
for item in results:
print(f"Extracted {item.metadata['Content-Type']} to {item.file_path}")
🔧 Development
Environment Setup
-
Ensure that you have the system dependencies installed
-
Install uv:
pip install uv
-
Install python dependencies and create the Virtual Environment:
uv sync
Common Tasks
make ruff # Format and lint code
make test # Run test suite
make docs # Generate documentation
make stubs # Generate Java stubs
make prepush # Run all checks (ruff, test, coverage, safety)
🤔 When to Use Tikara
Ideal Use Cases
- Python applications needing document processing
- Microservices and containerized environments
- Data processing pipelines (Ray, Dask, Prefect)
- Applications requiring direct Tika integration without HTTP overhead
Advanced Usage
For detailed documentation on:
- Custom parser implementation
- Custom detector creation
- MIME type handling
See the Example Jupyter Notebooks 📔
🎯 Inspiration
Tikara builds on the shoulders of giants:
- Apache Tika - The powerful content detection and extraction toolkit
- tika-python - The original Python Tika wrapper using HTTP that inspired this project
- JPype - The bridge between Python and Java
Considerations
- Process isolation: Tika crashes will affect the host application
- Memory management: Large documents require careful handling
- JVM startup: Initial overhead for first operation
- Custom implementations: Parser/detector development requires Java interface knowledge
📊 Performance Considerations
Memory Management
- Use streaming for large files
- Monitor JVM heap usage
- Consider process isolation for critical applications
Optimization Tips
- Reuse Tika instances
- Use appropriate output formats
- Implement custom parsers for specific needs
- Configure JVM parameters for your use case
🔐 Security Considerations
- Input validation
- Resource limits
- Secure file handling
- Access control for extracted content
- Careful handling of custom parsers
🤝 Contributing
Contributions welcome! The project uses Make for development tasks:
make prepush # Run all checks (format, lint, test, coverage, safety)
For developing custom parsers/detectors, Java stubs can be generated:
make stubs # Generate Java stubs for Apache Tika interfaces
Note: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.
Common Problems
- Verify Java installation and
JAVA_HOMEenvironment variable - Ensure Tesseract and required language packs are installed
- Check file permissions and paths
- Monitor memory usage when processing large files
- Use streaming output for large documents
📚 Reference
See API Documentation for complete details.
📄 License
Apache License 2.0 - See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tikara-0.1.6.tar.gz.
File metadata
- Download URL: tikara-0.1.6.tar.gz
- Upload date:
- Size: 49.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88a53de1da6ae14032c11ee3a1b5ef9259082307ea5defd6edc8b98548ccad66
|
|
| MD5 |
e4695e475d8ce5ceaf4292ad7ae40698
|
|
| BLAKE2b-256 |
2984936fc217908161f20b06648b90d8699d7690725e4e0ee3d5d8a5791a3229
|
Provenance
The following attestation bundles were made for tikara-0.1.6.tar.gz:
Publisher:
release.yml on baughmann/tikara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tikara-0.1.6.tar.gz -
Subject digest:
88a53de1da6ae14032c11ee3a1b5ef9259082307ea5defd6edc8b98548ccad66 - Sigstore transparency entry: 166256393
- Sigstore integration time:
-
Permalink:
baughmann/tikara@21a8136f7b72e546c7cc7f7a913d42cdb282db6d -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/baughmann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@21a8136f7b72e546c7cc7f7a913d42cdb282db6d -
Trigger Event:
push
-
Statement type:
File details
Details for the file tikara-0.1.6-py3-none-any.whl.
File metadata
- Download URL: tikara-0.1.6-py3-none-any.whl
- Upload date:
- Size: 49.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05bef928a794bda44aa1d335e5d9f67111a868f3c41375ac584b348b883b8aef
|
|
| MD5 |
291ee90c415d395836a2cb7e2f59debc
|
|
| BLAKE2b-256 |
36fef22f3e4029057a1faeba12cbdd1e51dccefcb9f60433e01734e07e3bb6e8
|
Provenance
The following attestation bundles were made for tikara-0.1.6-py3-none-any.whl:
Publisher:
release.yml on baughmann/tikara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tikara-0.1.6-py3-none-any.whl -
Subject digest:
05bef928a794bda44aa1d335e5d9f67111a868f3c41375ac584b348b883b8aef - Sigstore transparency entry: 166256395
- Sigstore integration time:
-
Permalink:
baughmann/tikara@21a8136f7b72e546c7cc7f7a913d42cdb282db6d -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/baughmann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@21a8136f7b72e546c7cc7f7a913d42cdb282db6d -
Trigger Event:
push
-
Statement type: