Skip to main content

LlamaIndex Legacy Office Reader, handles .doc files loading with Apache Tika

Project description

LlamaIndex Legacy Office Reader

Open In Colab

Overview

The Legacy Office Reader allows loading data from legacy Office documents (like Word 97 .doc files) using Apache Tika. It runs the Tika server locally to avoid remote server calls.

Installation

You can install the Legacy Office Reader via pip:

pip install llama-index-readers-legacy-office

Usage

Basic Usage

from llama_index.readers.legacy_office import LegacyOfficeReader

# Initialize LegacyOfficeReader
reader = LegacyOfficeReader(
    tika_server_jar_path="path/to/tika-server.jar",  # Optional: Path to Tika server JAR
)

# Load data from a legacy Office document
documents = reader.load_data(
    file="path/to/document.doc",  # Path to the legacy Office document
)

Using with SimpleDirectoryReader

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.legacy_office import LegacyOfficeReader

reader = SimpleDirectoryReader(
    input_dir="path/to/directory/",
    file_extractor={".doc": LegacyOfficeReader()},
)
documents = reader.load_data()

Features

  • Parses legacy Office documents (.doc) using Apache Tika
  • Optionally (default) runs Tika server locally to avoid remote server calls/dependencies
  • Extracts both content and metadata from documents
  • Supports batch processing of multiple documents
  • Seamless integration with SimpleDirectoryReader

Requirements

  • Java Runtime Environment (JRE) 11 or higher (required for Apache Tika 3.x)
  • Python 3.8 or higher

Notes

  • The first time you use the reader, it will download the Tika server JAR file if not provided
  • The Tika server will run locally on port 9998
  • All document metadata is preserved in the Document objects
  • Make sure you have Java 11+ installed and available in your system PATH
  • The reader uses Apache Tika 3.x

Credits

This reader is built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_legacy_office-0.2.1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_legacy_office-0.2.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f74eaa75dd9249626645325259ee04a0a305fd0ceed237d2f332c17253279ec3
MD5 51f1aab8d366e2584d772fa81bd5c0e2
BLAKE2b-256 70e5d8744fea37ab920d82c3d9633014f74ed47bcb772341ca25acdce29b3c36

See more details on using hashes here.

File details

Details for the file llama_index_readers_legacy_office-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 229de529ccf0c7c2618a0b5ebb6e9405e0e315e582260d9adbe0500bf4182428
MD5 20d71b8a6f1b623bd2562aa7ad506399
BLAKE2b-256 70f469e1c244edac4f189a313459c44740da577e809ec2ac989ac72af71740ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page