Streamlines the process of preparing documents for LLM's.
Project description
Easily chunk complex documents the same way a human would.
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.
Highlights
- 🔍 Visually-Driven: Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.
- ✍️ Markdown Support: Basic markdown support for parsing headings, bold and italics.
- 📊 High-Precision Table Support: Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.
- 🛠️ Extensible: Easily implement your own post-processing steps.
- 💡Intuitive: Great editor support. Completion everywhere. Less time debugging.
- 🎯 Easy: Designed to be easy to use and learn. Less time reading docs.
Example
import openparse
basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)
for node in parsed_basic_doc.nodes:
print(node)
📓 Try the sample notebook here
Requirements
Python 3.8+
Dealing with PDF's:
- pdfminer.six Fully open source.
Extracting Tables:
- PyMuPDF has some table detection functionality. Please see their license.
- Table Transformer is a deep learning approach.
- unitable is a more recent deep learning approach that seems promising (coming soon)
Installation
1. Core Library
pip install openparse
Enabling OCR Support:
PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.
The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.
So for a working OCR functionality, make sure to complete this checklist:
-
Install Tesseract.
-
Locate Tesseract’s language support folder. Typically you will find it here:
-
Windows:
C:/Program Files/Tesseract-OCR/tessdata
-
Unix systems:
/usr/share/tesseract-ocr/5/tessdata
-
-
Set the environment variable TESSDATA_PREFIX
-
Windows:
setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"
-
Unix systems:
declare -x TESSDATA_PREFIX= /usr/share/tesseract-ocr/5/tessdata
-
Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!
2. ML Table Detection (Optional)
This repository provides an optional feature to parse content from tables using the state-of-the-art Table Transformer (DETR) model. The Table Transformer model, introduced in the paper "PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents" by Smock et al., achieves best-in-class results for table extraction.
pip install "openparse[ml]"
Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for openparse-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fec60455042f9fbab5bf86197dcc665894a78494464cef88dc74b5e5451853d |
|
MD5 | 6f937d012c1b35da24bcdaf6da5b96bb |
|
BLAKE2b-256 | bd29849b8ea38c2e52f15625035b44d7dddb6155c6c9a55ad82edba345d49484 |