Skip to main content

Preprocess SDK

Project description

Preprocess SDK V1.6 MIT License

Preprocess is an API service that splits various types of documents into optimal chunks of text for use in language model tasks. It divides documents into chunks that respect the layout and semantics of the original content, accounting for sections, paragraphs, lists, images, data tables, text tables, and slides.

We support the following formats:

  • PDFs
  • Microsoft Office documents (Word, PowerPoint, Excel)
  • OpenOffice documents (ODS, ODT, ODP)
  • HTML content (web pages, articles, emails)
  • Plain text

Installation

To install the Python Preprocess library, use:

pip install pypreprocess

Alternatively, to add it as a dependency with Poetry:

poetry add pypreprocess
poetry install

Note: You need a Preprocess API Key to use the SDK. To obtain one, please contact support@preprocess.co.

Getting Started

Retrieve chunks from a file for use in your language model tasks:

from pypreprocess import Preprocess

# Initialize the SDK with a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")

# Chunk the file
preprocess.chunk()
preprocess.wait()

# Get the result
result = preprocess.result()
for chunk in result.data['chunks']:
    # Use the chunks

Initialization Options

You can initialize the SDK in three different ways:

1- Passing a local filepath:

Use this when you want to chunk a local file:

from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")

2- Passing a process_id:

When the chunking process starts, Preprocess generates a process_id that can be used to initialize the SDK later:

from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, process_id="id_of_the_process")

3- Passing a PreprocessResponse Object:

When you need to store and reload the result of a chunking process later, you can use the PreprocessResponse object:

import json
from pypreprocess import Preprocess, PreprocessResponse
response = PreprocessResponse(**json.loads("The JSON result from a previous chunking process."))
preprocess = Preprocess(api_key=YOUR_API_KEY, process=response)

Chunking Options

Preprocess offers several configuration options to tailor the chunking process to your needs.

Note: Preprocess attempts to output chunks with less than 512 tokens. Longer chunks may sometimes be produced to preserve content integrity. We are currently working to allow user-defined chunk lengths.

Parameter Type Default Description
merge bool False If True small paragraphs will be merged to maximize chunk length.
repeat_title bool False If True each chunk will start with the title of the section it belongs to.
repeat_table_header bool False If True, each chunk that contains part of a table will include the table header.
table_output_format enum ['text', 'markdown', 'html'] 'text' Output table format.
keep_header bool True If set to False, the content of the headers will be removed. Headers may include page numbers, document titles, section titles, paragraph titles, and fixed layout elements.
smart_header bool True If set to True, only relevant headers will be included in the chunks, while other information will be removed. Relevant headers are those that should be part of the body of the page as a section/paragraph title. If set to False, only the keep_header parameter will be considered. If keep_header is False, the smart_header parameter will be ignored.
keep_footer bool False If set to True, the content of the footers will be included in the chunks. Footers may include page numbers, footnotes, and fixed layout elements.
image_text bool False If set to True, the text contained in the images will be added to the chunks.
boundary_boxes bool False If set to True, returns bounding box coordinates (top, left, height, width) for each chunk.

You can pass these parameters during SDK initialization:

preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file", merge=True, repeat_title=True, ...)
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file", options={"merge": True, "repeat_title": True, ...})

Or, set them later using the set_options method with a dict:

preprocess.set_options({"merge": True, "repeat_title": True, ...})
preprocess.set_options(merge=True, repeat_title=True, ...)

Note: if the parameter is present inside options dictionary, it will override the parameter passed in the function.

Chunking Files

After initializing the SDK with a filepath, use the chunk() method to start chunking the file:

from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
response = preprocess.chunk()

The response contains the process_id and details about the API call's success.

Retrieving Results

The chunking process may take some time. You can wait for completion using the wait() method:

result = preprocess.wait()
print(result.data['chunks'])

In more complex workflows, store the process_id and retrieve the result later:

# Start chunking process
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
preprocess.chunk()
process_id = preprocess.get_process_id()

# In a different flow
preprocess = Preprocess(api_key=YOUR_API_KEY, process_id=process_id)
result = preprocess.wait()
print(result.data['chunks'])

Alternatively, use the result() method to check if the process is complete:

result = preprocess.result()
if result.data['process']['status'] == "FINISHED": 
    print(result.data['chunks'])

Other Useful Methods

Here are additional methods available in the SDK:

  • set_filepath(path): Set the file path after initialization.
  • set_process_id(id): Set the process_id parameter by ID.
  • set_process(PreprocessResponse): Set the process_id using a PreprocessResponse object.
  • set_options(dict): Set chunking options using a dictionary.
  • to_json(): Return a JSON string representing the current object.
  • get_process_id(): Retrieve the current process_id.
  • get_filepath(): Retrieve the file path.
  • get_options(): Retrieve the current chunking options.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypreprocess-1.6.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

pypreprocess-1.6-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file pypreprocess-1.6.tar.gz.

File metadata

  • Download URL: pypreprocess-1.6.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.10

File hashes

Hashes for pypreprocess-1.6.tar.gz
Algorithm Hash digest
SHA256 dc813f6272a8f3d851b5be57bf8f553141f41a29d4db502f4ac2aa87ee55d037
MD5 8b167f0dfa746fdeed8fa40db863b79b
BLAKE2b-256 78fac5d4e8571d8f85349db747a31e50a565187af7cd85d614c32000fef248cc

See more details on using hashes here.

File details

Details for the file pypreprocess-1.6-py3-none-any.whl.

File metadata

  • Download URL: pypreprocess-1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.10

File hashes

Hashes for pypreprocess-1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8de4fce30cb6b1aa79aa2ae964668a45cd0e55ad892c84c5d529e41f66f53d2e
MD5 6a3f813bb7f93fc137edb0fd8442506b
BLAKE2b-256 4c6570010dc82a44a2cd319797fdac1397ad3bde797e305ed97259ddf1bb7bc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page