llama-index readers preprocess integration

These details have not been verified by PyPI

Project description

Preprocess Loader

pip install llama-index-readers-preprocess

Preprocess is an API service that splits any kind of document into optimal chunks of text for use in language model tasks. Given documents in input Preprocess splits them into chunks of text that respect the layout and semantics of the original document. We split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts. We support PDFs, Microsoft Office documents (Word, PowerPoint, Excel), OpenOffice documents (ods, odt, odp), HTML content (web pages, articles, emails), and plain text.

This loader integrates with the Preprocess API library to provide document conversion and chunking or to load already chunked files inside LlamaIndex.

Requirements

Install the Python Preprocess library if it is not already present:

pip install pypreprocess

Usage

To use this loader, you need to pass the Preprocess API Key. When initializing PreprocessReader, you should pass your API Key, if you don't have it yet, please ask for one at support@preprocess.co. Without an API Key, the loader will raise an error.

To chunk a file pass a valid filepath and the reader will start converting and chunking it. Preprocess will chunk your files by applying an internal Splitter. For this reason, you should not parse the document into nodes using a Splitter or applying a Splitter while transforming documents in your IngestionPipeline.

If you want to handle the nodes directly:

from llama_index.core import VectorStoreIndex

from llama_index.readers.preprocess import PreprocessReader

# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
nodes = loader.get_nodes()

# import the nodes in a Vector Store with your configuration
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()

By default load_data() returns a document for each chunk, remember to not apply any splitting to these documents

from llama_index.core import VectorStoreIndex

from llama_index.readers.preprocess import PreprocessReader

# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
documents = loader.load_data()

# don't apply any Splitter parser to documents
# if you have an ingestion pipeline you should not apply a Splitter in the transformations
# import the documents in a Vector Store, if you set the service_context parameter remember to avoid including a splitter
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

If you want to return only the extracted text and handle it with custom pipelines set return_whole_document = True

# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
document = loader.load_data(return_whole_document=True)

If you want to load already chunked files you can do it via process_id passing it to the reader.

# pass a process_id obtained from a previous instance and get the chunks as one string inside a Document
loader = PreprocessReader(api_key="your-api-key", process_id="your-process-id")

This loader is designed to be used as a way to load data into LlamaIndex.

Other info

PreprocessReader is based on pypreprocess from Preprocess library. For more information or other integration needs please check the documentation.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Mar 12, 2026

0.5.0

Feb 20, 2026

0.4.1

Sep 8, 2025

This version

0.4.0

Jul 30, 2025

0.3.0

Nov 18, 2024

0.2.0

Aug 22, 2024

0.1.4

Aug 18, 2024

0.1.3

Feb 21, 2024

0.1.2

Feb 13, 2024

0.1.1

Feb 12, 2024

0.1.0

Feb 10, 2024

0.0.1

Feb 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_preprocess-0.4.0.tar.gz (6.3 kB view details)

Uploaded Jul 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_readers_preprocess-0.4.0-py3-none-any.whl (6.1 kB view details)

Uploaded Jul 30, 2025 Python 3

File details

Details for the file llama_index_readers_preprocess-0.4.0.tar.gz.

File metadata

Download URL: llama_index_readers_preprocess-0.4.0.tar.gz
Upload date: Jul 30, 2025
Size: 6.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for llama_index_readers_preprocess-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`91c0d1ada6e0cea55ac82d88d6cd5ed4259d81efb269a647556c1a5e0ecb72fb`
MD5	`df9bd455eb8fd0cc5d01745e01dd6b22`
BLAKE2b-256	`b73346c213865360a3a5057d7afca2c9de774e3c07da888f7aa1cc7c535edf25`

See more details on using hashes here.

File details

Details for the file llama_index_readers_preprocess-0.4.0-py3-none-any.whl.

File metadata

Download URL: llama_index_readers_preprocess-0.4.0-py3-none-any.whl
Upload date: Jul 30, 2025
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for llama_index_readers_preprocess-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3cc57337d4596fecda0c78a8b86b9bae9aa46ec866e3fc3afe6a76b0f449394`
MD5	`1ec7106d149fc582aa0b7effc4154e4e`
BLAKE2b-256	`5c3dfefcf4271b40382c19aedbbff6b2ef64ecdb6cafa4f4c3f4af6fe6d926c7`

See more details on using hashes here.

llama-index-readers-preprocess 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Preprocess Loader

Requirements

Usage

Other info

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes