Skip to main content

LlamaIndex Legacy Office Reader, handles .doc files loading with Apache Tika

Project description

LlamaIndex Legacy Office Reader

Open In Colab

Overview

The Legacy Office Reader allows loading data from legacy Office documents (like Word 97 .doc files) using Apache Tika. It runs the Tika server locally to avoid remote server calls.

Installation

You can install the Legacy Office Reader via pip:

pip install llama-index-readers-legacy-office

Usage

Basic Usage

from llama_index.readers.legacy_office import LegacyOfficeReader

# Initialize LegacyOfficeReader
reader = LegacyOfficeReader(
    tika_server_jar_path="path/to/tika-server.jar",  # Optional: Path to Tika server JAR
)

# Load data from a legacy Office document
documents = reader.load_data(
    file="path/to/document.doc",  # Path to the legacy Office document
)

Using with SimpleDirectoryReader

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.legacy_office import LegacyOfficeReader

reader = SimpleDirectoryReader(
    input_dir="path/to/directory/",
    file_extractor={".doc": LegacyOfficeReader()},
)
documents = reader.load_data()

Features

  • Parses legacy Office documents (.doc) using Apache Tika
  • Optionally (default) runs Tika server locally to avoid remote server calls/dependencies
  • Extracts both content and metadata from documents
  • Supports batch processing of multiple documents
  • Seamless integration with SimpleDirectoryReader

Requirements

  • Java Runtime Environment (JRE) 11 or higher (required for Apache Tika 3.x)
  • Python 3.8 or higher

Notes

  • The first time you use the reader, it will download the Tika server JAR file if not provided
  • The Tika server will run locally on port 9998
  • All document metadata is preserved in the Document objects
  • Make sure you have Java 11+ installed and available in your system PATH
  • The reader uses Apache Tika 3.x

Credits

This reader is built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_legacy_office-0.1.1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_legacy_office-0.1.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5ac8be3372e11004ecd780b164e597456cf9d4c9f8edc567683adb5851fddc25
MD5 f87e6b9f73ff439b284f9cbe4c4ebe9d
BLAKE2b-256 4162eccef9043a4e490979c1849a01c527a907ff452c2fe75ee70b9fc65f66b7

See more details on using hashes here.

File details

Details for the file llama_index_readers_legacy_office-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b629b88a3ace6b457d3a6ec868f60d1bb977996ef284e8e844cd6e87448de83a
MD5 f12a16cd032a6826863a8a87412f4e72
BLAKE2b-256 7473047434bce6cfbb5c9f5e27460e9f0370ba811f10784656e027644bc9dab0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page