Skip to main content

Streamlines the process of preparing documents for LLM's.

Project description


Easily chunk complex documents the same way a human would.

Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.

Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.

Highlights

  • 🔍 Visually-Driven: Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.
  • ✍️ Markdown Support: Basic markdown support for parsing headings, bold and italics.
  • 📊 High-Precision Table Support: Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.
  • 🛠️ Extensible: Easily implement your own post-processing steps.
  • 💡Intuitive: Great editor support. Completion everywhere. Less time debugging.
  • 🎯 Easy: Designed to be easy to use and learn. Less time reading docs.

Example

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

📓 Try the sample notebook here

Requirements

Python 3.8+

Dealing with PDF's:

Extracting Tables:

  • PyMuPDF has some table detection functionality. Please see their license.
  • Table Transformer is a deep learning approach.
  • unitable is a more recent deep learning approach that seems promising (coming soon)

Installation

1. Core Library

pip install openparse

Enabling OCR Support:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.

The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.

So for a working OCR functionality, make sure to complete this checklist:

  1. Install Tesseract.

  2. Locate Tesseract’s language support folder. Typically you will find it here:

    • Windows: C:/Program Files/Tesseract-OCR/tessdata

    • Unix systems: /usr/share/tesseract-ocr/5/tessdata

  3. Set the environment variable TESSDATA_PREFIX

    • Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"

    • Unix systems: declare -x TESSDATA_PREFIX= /usr/share/tesseract-ocr/5/tessdata

Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!

2. ML Table Detection (Optional)

This repository provides an optional feature to parse content from tables using the state-of-the-art Table Transformer (DETR) model. The Table Transformer model, introduced in the paper "PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents" by Smock et al., achieves best-in-class results for table extraction.

pip install "openparse[ml]"

Documentation

https://filimoa.github.io/open-parse/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openparse-0.4.1.tar.gz (75.0 kB view hashes)

Uploaded Source

Built Distribution

openparse-0.4.1-py3-none-any.whl (87.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page