Streamlines the process of preparing documents for LLM's.
Project description
Easily chunk complex documents the same way a human would.
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.
How is this different from other layout parsers?
✂️ Text Splitting
Text splitting converts a file to raw text and slices it up.
- You lose the ability to easily overlay the chunk on the original pdf
- You ignore the underlying semantic structure of the file - headings, sections, bullets represent valuable information.
- No support for tables, images or markdown.
🤖 ML Layout Parsers
There's some of fantastic libraries like layout-parser.
- While they can identify various elements like text blocks, images, and tables, but they are not built to group related content effectively.
- They strictly focus on layout parsing - you will need to add another model to extract markdown from the images, parse tables, group nodes, etc.
- We've found performance to be sub-optimal on many documents while also being computationally heavy.
💼 Commercial Solutions
Highlights
- 🔍 Visually-Driven: Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.
- ✍️ Markdown Support: Basic markdown support for parsing headings, bold and italics.
- 📊 High-Precision Table Support: Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.
- 🛠️ Extensible: Easily implement your own post-processing steps.
- 💡Intuitive: Great editor support. Completion everywhere. Less time debugging.
- 🎯 Easy: Designed to be easy to use and learn. Less time reading docs.
Example
import openparse
basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)
for node in parsed_basic_doc.nodes:
print(node)
📓 Try the sample notebook here
Requirements
Python 3.8+
Dealing with PDF's:
- pdfminer.six Fully open source.
Extracting Tables:
- PyMuPDF has some table detection functionality. Please see their license.
- Table Transformer is a deep learning approach.
- unitable is another transformers based approach with state-of-the-art performance.
Installation
1. Core Library
pip install openparse
Enabling OCR Support:
PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.
The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.
So for a working OCR functionality, make sure to complete this checklist:
-
Install Tesseract.
-
Locate Tesseract’s language support folder. Typically you will find it here:
-
Windows:
C:/Program Files/Tesseract-OCR/tessdata
-
Unix systems:
/usr/share/tesseract-ocr/5/tessdata
-
-
Set the environment variable TESSDATA_PREFIX
-
Windows:
setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"
-
Unix systems:
declare -x TESSDATA_PREFIX= /usr/share/tesseract-ocr/5/tessdata
-
Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!
2. ML Table Detection (Optional)
This repository provides an optional feature to parse content from tables using the state-of-the-art Table Transformer (DETR) model. The Table Transformer model, introduced in the paper "PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents" by Smock et al., achieves best-in-class results for table extraction.
pip install "openparse[ml]"
Then download the model weights with
openparse-download
Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for openparse-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32697c963da8e96c7498ffd957aef480b4973bf0ca1290e62c071f7465d55bf3 |
|
MD5 | 7e8caecf035483c17c4e7c66aa2d0514 |
|
BLAKE2b-256 | 637b5bfb209d989353c39ec81f8cbf56c3b00b4cf8217b0c5cdb5fb143d3a7ff |