LlamaIndex Legacy Office Reader, handles .doc files loading with Apache Tika
Project description
LlamaIndex Legacy Office Reader
Overview
The Legacy Office Reader allows loading data from legacy Office documents (like Word 97 .doc files) using Apache Tika. It runs the Tika server locally to avoid remote server calls.
Installation
You can install the Legacy Office Reader via pip:
pip install llama-index-readers-legacy-office
Usage
Basic Usage
from llama_index.readers.legacy_office import LegacyOfficeReader
# Initialize LegacyOfficeReader
reader = LegacyOfficeReader(
tika_server_jar_path="path/to/tika-server.jar", # Optional: Path to Tika server JAR
)
# Load data from a legacy Office document
documents = reader.load_data(
file="path/to/document.doc", # Path to the legacy Office document
)
Using with SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.legacy_office import LegacyOfficeReader
reader = SimpleDirectoryReader(
input_dir="path/to/directory/",
file_extractor={".doc": LegacyOfficeReader()},
)
documents = reader.load_data()
Features
- Parses legacy Office documents (
.doc) using Apache Tika - Optionally (default) runs Tika server locally to avoid remote server calls/dependencies
- Extracts both content and metadata from documents
- Supports batch processing of multiple documents
- Seamless integration with SimpleDirectoryReader
Requirements
- Java Runtime Environment (JRE) 11 or higher (required for Apache Tika 3.x)
- Python 3.8 or higher
Notes
- The first time you use the reader, it will download the Tika server JAR file if not provided
- The Tika server will run locally on port
9998 - All document metadata is preserved in the Document objects
- Make sure you have Java 11+ installed and available in your system PATH
- The reader uses Apache Tika 3.x
Credits
This reader is built on top of:
- Apache Tika - Content analysis toolkit
- tika-python - Python bindings for Apache Tika
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_legacy_office-0.1.1.tar.gz.
File metadata
- Download URL: llama_index_readers_legacy_office-0.1.1.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ac8be3372e11004ecd780b164e597456cf9d4c9f8edc567683adb5851fddc25
|
|
| MD5 |
f87e6b9f73ff439b284f9cbe4c4ebe9d
|
|
| BLAKE2b-256 |
4162eccef9043a4e490979c1849a01c527a907ff452c2fe75ee70b9fc65f66b7
|
File details
Details for the file llama_index_readers_legacy_office-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_legacy_office-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b629b88a3ace6b457d3a6ec868f60d1bb977996ef284e8e844cd6e87448de83a
|
|
| MD5 |
f12a16cd032a6826863a8a87412f4e72
|
|
| BLAKE2b-256 |
7473047434bce6cfbb5c9f5e27460e9f0370ba811f10784656e027644bc9dab0
|