Skip to main content

LlamaIndex Legacy Office Reader, handles .doc files loading with Apache Tika

Project description

LlamaIndex Legacy Office Reader

Open In Colab

Overview

The Legacy Office Reader allows loading data from legacy Office documents (like Word 97 .doc files) using Apache Tika. It runs the Tika server locally to avoid remote server calls.

Installation

You can install the Legacy Office Reader via pip:

pip install llama-index-readers-legacy-office

Usage

Basic Usage

from llama_index.readers.legacy_office import LegacyOfficeReader

# Initialize LegacyOfficeReader
reader = LegacyOfficeReader(
    tika_server_jar_path="path/to/tika-server.jar",  # Optional: Path to Tika server JAR
)

# Load data from a legacy Office document
documents = reader.load_data(
    file="path/to/document.doc",  # Path to the legacy Office document
)

Using with SimpleDirectoryReader

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.legacy_office import LegacyOfficeReader

reader = SimpleDirectoryReader(
    input_dir="path/to/directory/",
    file_extractor={".doc": LegacyOfficeReader()},
)
documents = reader.load_data()

Features

  • Parses legacy Office documents (.doc) using Apache Tika
  • Optionally (default) runs Tika server locally to avoid remote server calls/dependencies
  • Extracts both content and metadata from documents
  • Supports batch processing of multiple documents
  • Seamless integration with SimpleDirectoryReader

Requirements

  • Java Runtime Environment (JRE) 11 or higher (required for Apache Tika 3.x)
  • Python 3.8 or higher

Notes

  • The first time you use the reader, it will download the Tika server JAR file if not provided
  • The Tika server will run locally on port 9998
  • All document metadata is preserved in the Document objects
  • Make sure you have Java 11+ installed and available in your system PATH
  • The reader uses Apache Tika 3.x

Credits

This reader is built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_legacy_office-0.3.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_legacy_office-0.3.0.tar.gz.

File metadata

  • Download URL: llama_index_readers_legacy_office-0.3.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_legacy_office-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2e1d2c128fbcb29a86921e8e1d45d1aadebd2ac795ae330910c89b8ebf162b08
MD5 5762f61c4b9e4a7c2a958266d1aae87d
BLAKE2b-256 35cbc222557e17e2d7c348bc4bbefaafd43cad59d71f6701215ecddb5b38f548

See more details on using hashes here.

File details

Details for the file llama_index_readers_legacy_office-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: llama_index_readers_legacy_office-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_legacy_office-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58cf8e853d7604d1bb04dc768ecdd32c580d6de0e9451a7aef67b089056106d4
MD5 2415e20994e999a3605670ada71029e7
BLAKE2b-256 1325ac7fd85054cbbfa7599fced5234eef4c0356a9e73c7d2206f51523023e4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page