Skip to main content

llama-index readers file integration

Project description

LlamaIndex Readers Integration: File

pip install llama-index-readers-file

This is the default integration for different loaders that are used within SimpleDirectoryReader.

Provides support for the following loaders:

  • DocxReader
  • HWPReader
  • PDFReader
  • EpubReader
  • FlatReader
  • HTMLTagReader
  • ImageCaptionReader
  • ImageReader
  • ImageVisionLLMReader
  • IPYNBReader
  • MarkdownReader
  • MboxReader
  • PptxReader
  • PandasCSVReader
  • VideoAudioReader
  • UnstructuredReader
  • PyMuPDFReader
  • ImageTabularChartReader
  • XMLReader
  • PagedCSVReader
  • CSVReader
  • RTFReader

Installation

pip install llama-index-readers-file

Usage

Once installed, You can import any of the loader. Here's an example usage of one of the loader.

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import (
    DocxReader,
    HWPReader,
    PDFReader,
    EpubReader,
    FlatReader,
    HTMLTagReader,
    ImageCaptionReader,
    ImageReader,
    ImageVisionLLMReader,
    IPYNBReader,
    MarkdownReader,
    MboxReader,
    PptxReader,
    PandasCSVReader,
    VideoAudioReader,
    UnstructuredReader,
    PyMuPDFReader,
    ImageTabularChartReader,
    XMLReader,
    PagedCSVReader,
    CSVReader,
    RTFReader,
)

# PDF Reader with `SimpleDirectoryReader`
parser = PDFReader()
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Docx Reader example
parser = DocxReader()
file_extractor = {".docx": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# HWP Reader example
parser = HWPReader()
file_extractor = {".hwp": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Epub Reader example
parser = EpubReader()
file_extractor = {".epub": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Flat Reader example
parser = FlatReader()
file_extractor = {".txt": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# HTML Tag Reader example
parser = HTMLTagReader()
file_extractor = {".html": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Image Reader example
parser = ImageReader()
file_extractor = {
    ".jpg": parser,
    ".jpeg": parser,
    ".png": parser,
}  # Add other image formats as needed
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# IPYNB Reader example
parser = IPYNBReader()
file_extractor = {".ipynb": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Markdown Reader example
parser = MarkdownReader()
file_extractor = {".md": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Mbox Reader example
parser = MboxReader()
file_extractor = {".mbox": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Pptx Reader example
# Basic usage - extracts text, tables, charts, and speaker notes
parser = PptxReader()

# Advanced usage - control parsing behavior
parser = PptxReader(
    extract_images=True,  # Enable image captioning
    context_consolidation_with_llm=True,  # Use LLM for content synthesis
    num_workers=4,  # Parallel processing
    batch_size=10,  # Slides processed per worker batch
    raise_on_error=True,  # Raise value error if file_parsing is not successful
)

file_extractor = {".pptx": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()


# Pandas CSV Reader example
parser = PandasCSVReader()
file_extractor = {".csv": parser}  # Add other CSV formats as needed
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# PyMuPDF Reader example
parser = PyMuPDFReader()
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# XML Reader example
parser = XMLReader()
file_extractor = {".xml": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Paged CSV Reader example
parser = PagedCSVReader()
file_extractor = {".csv": parser}  # Add other CSV formats as needed
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# CSV Reader example
parser = CSVReader()
file_extractor = {".csv": parser}  # Add other CSV formats as needed
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

This loader is designed to be used as a way to load data into LlamaIndex.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_file-0.6.0.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_file-0.6.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_file-0.6.0.tar.gz.

File metadata

  • Download URL: llama_index_readers_file-0.6.0.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_file-0.6.0.tar.gz
Algorithm Hash digest
SHA256 ff366d6ff5ecb7119275ac859310d8b672d8b6b3261afae02f4084fce9076bd0
MD5 c5518d181b4d5b38eb98030dee53d215
BLAKE2b-256 29b95d5ecb81febd544a04b38f4ef7d3a69ec0625204f3c0f3d8200b70f16c2d

See more details on using hashes here.

File details

Details for the file llama_index_readers_file-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: llama_index_readers_file-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_file-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1026d94f2d5902152373bc2c3b7caa7e216d956620b22d510e516850b6a7440d
MD5 f2eee285a88bd7b728543b8c54570683
BLAKE2b-256 7eabff4b2e14a63708abc0115ccb13e41b324972c0346d9b82d1fb8dbb77873a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page