Skip to main content

LlamaIndex Legacy Office Reader, handles .doc files loading with Apache Tika

Project description

LlamaIndex Legacy Office Reader

Open In Colab

Overview

The Legacy Office Reader allows loading data from legacy Office documents (like Word 97 .doc files) using Apache Tika. It runs the Tika server locally to avoid remote server calls.

Installation

You can install the Legacy Office Reader via pip:

pip install llama-index-readers-legacy-office

Usage

Basic Usage

from llama_index.readers.legacy_office import LegacyOfficeReader

# Initialize LegacyOfficeReader
reader = LegacyOfficeReader(
    tika_server_jar_path="path/to/tika-server.jar",  # Optional: Path to Tika server JAR
)

# Load data from a legacy Office document
documents = reader.load_data(
    file="path/to/document.doc",  # Path to the legacy Office document
)

Using with SimpleDirectoryReader

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.legacy_office import LegacyOfficeReader

reader = SimpleDirectoryReader(
    input_dir="path/to/directory/",
    file_extractor={".doc": LegacyOfficeReader()},
)
documents = reader.load_data()

Features

  • Parses legacy Office documents (.doc) using Apache Tika
  • Optionally (default) runs Tika server locally to avoid remote server calls/dependencies
  • Extracts both content and metadata from documents
  • Supports batch processing of multiple documents
  • Seamless integration with SimpleDirectoryReader

Requirements

  • Java Runtime Environment (JRE) 11 or higher (required for Apache Tika 3.x)
  • Python 3.8 or higher

Notes

  • The first time you use the reader, it will download the Tika server JAR file if not provided
  • The Tika server will run locally on port 9998
  • All document metadata is preserved in the Document objects
  • Make sure you have Java 11+ installed and available in your system PATH
  • The reader uses Apache Tika 3.x

Credits

This reader is built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_legacy_office-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_index_readers_legacy_office-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dfcde35e6e1e10a696d0eb4d3ccb8f554da8b6586585764f41c7afe71754ee74
MD5 9dda492ce548b250225def34e5e9b23e
BLAKE2b-256 96f1d5655fb75224367a24772f476c92f12f3590943882571a38fc55a0d0334a

See more details on using hashes here.

File details

Details for the file llama_index_readers_legacy_office-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_legacy_office-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39bdfc8e01ea62b59138268e13f99086bdcd173eab13a1907761272406047f80
MD5 8e496057addd8087a4edc55be6487a53
BLAKE2b-256 994a2239aa7cebce93c81fdc77a3048ecd2e2a5bf43b49abec816c47ae65fb91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page