openparse

Streamlines the process of preparing documents for LLM's.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

Easily chunk complex documents the same way a human would.

Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.

Open Parse is designed to fill this gap by providing a flexible, easy-to-use library capable of visually discerning document layouts and chunking them effectively.

How is this different from other layout parsers?

✂️ Text Splitting

Text splitting converts a file to raw text and slices it up.

You lose the ability to easily overlay the chunk on the original pdf
You ignore the underlying semantic structure of the file - headings, sections, bullets represent valuable information.
No support for tables, images or markdown.

🤖 ML Layout Parsers

There's some of fantastic libraries like layout-parser.

While they can identify various elements like text blocks, images, and tables, but they are not built to group related content effectively.
They strictly focus on layout parsing - you will need to add another model to extract markdown from the images, parse tables, group nodes, etc.
We've found performance to be sub-optimal on many documents while also being computationally heavy.

💼 Commercial Solutions

Typically priced at ≈ $10 / 1k pages. See here, here and here.
Requires sharing your data with a vendor

Highlights

🔍 Visually-Driven: Open-Parse visually analyzes documents for superior LLM input, going beyond naive text splitting.
✍️ Markdown Support: Basic markdown support for parsing headings, bold and italics.
📊 High-Precision Table Support: Extract tables into clean Markdown formats with accuracy that surpasses traditional tools.

Examples
The following examples were parsed with unitable.
🛠️ Extensible: Easily implement your own post-processing steps.
💡Intuitive: Great editor support. Completion everywhere. Less time debugging.
🎯 Easy: Designed to be easy to use and learn. Less time reading docs.

Example

Basic Example

import openparse

basic_doc_path = "./sample-docs/mobile-home-manual.pdf"
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node)

📓 Try the sample notebook here

Semantic Processing Example

Chunking documents is fundamentally about grouping similar semantic nodes together. By embedding the text of each node, we can then cluster them together based on their similarity.

from openparse import processing, DocumentParser

semantic_pipeline = processing.SemanticIngestionPipeline(
    openai_api_key=OPEN_AI_KEY,
    model="text-embedding-3-large",
    min_tokens=64,
    max_tokens=1024,
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(basic_doc_path)

📓 Sample notebook here

Serializing Results

Uses pydantic under the hood so you can serialize results with

parsed_content.dict()

# or to convert to a valid json dict
parsed_content.json()

Requirements

Python 3.8+

Dealing with PDF's:

pdfminer.six Fully open source.

Extracting Tables:

PyMuPDF has some table detection functionality. Please see their license.
Table Transformer is a deep learning approach.
unitable is another transformers based approach with state-of-the-art performance.

Installation

1. Core Library

pip install openparse

Enabling OCR Support:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required.

The language support folder location must be communicated either via storing it in the environment variable "TESSDATA_PREFIX", or as a parameter in the applicable functions.

So for a working OCR functionality, make sure to complete this checklist:

Install Tesseract.
Locate Tesseract’s language support folder. Typically you will find it here:
- Windows: C:/Program Files/Tesseract-OCR/tessdata
- Unix systems: /usr/share/tesseract-ocr/5/tessdata
- macOS (installed via Homebrew):
  - Standard installation: /opt/homebrew/share/tessdata
  - Version-specific installation: /opt/homebrew/Cellar/tesseract/<version>/share/tessdata/
Set the environment variable TESSDATA_PREFIX
- Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"
- Unix systems: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
- macOS (installed via Homebrew): export TESSDATA_PREFIX=$(brew --prefix tesseract)/share/tessdata

Note: On Windows systems, this must happen outside Python – before starting your script. Just manipulating os.environ will not work!

2. ML Table Detection (Optional)

This repository provides an optional feature to parse content from tables using a variety of deep learning models.

pip install "openparse[ml]"

Then download the model weights with

openparse-download

You can run the parsing with the following.

parser = openparse.DocumentParser(
        table_args={
            "parsing_algorithm": "unitable",
            "min_table_confidence": 0.8,
        },
)
parsed_nodes = parser.parse(pdf_path)

Note we currently use table-transformers for all table detection and we find its performance to be subpar. This negatively affects the downstream results of unitable. If you're aware of a better model please open an Issue - the unitable team mentioned they might add this soon too.

Cookbooks

https://github.com/Filimoa/open-parse/tree/main/src/cookbooks

Documentation

https://filimoa.github.io/open-parse/

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

filimoa

Release history Release notifications | RSS feed

This version

0.7.0

Nov 13, 2024

0.6.1

Nov 7, 2024

0.6.0

Sep 24, 2024

0.5.8

Sep 23, 2024

0.5.7

Jun 13, 2024

0.5.6

May 2, 2024

0.5.5

Apr 28, 2024

0.5.4

Apr 24, 2024

0.5.3

Apr 22, 2024

0.5.2

Apr 11, 2024

0.5.1

Apr 8, 2024

0.5.0

Apr 8, 2024

0.4.1

Apr 5, 2024

0.4.0

Apr 5, 2024

0.3.1

Apr 1, 2024

0.3.0

Mar 31, 2024

0.2

Mar 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openparse-0.7.0.tar.gz (83.8 kB view details)

Uploaded Nov 13, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openparse-0.7.0-py3-none-any.whl (94.2 kB view details)

Uploaded Nov 13, 2024 Python 3

File details

Details for the file openparse-0.7.0.tar.gz.

File metadata

Download URL: openparse-0.7.0.tar.gz
Upload date: Nov 13, 2024
Size: 83.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for openparse-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`965a84ebed051063516c7e0e6e3bc7352c216d6bb54bf2786c369481689554fa`
MD5	`1ef245e357faca830d033a2fef29317e`
BLAKE2b-256	`d3d7e38a0a8fa1762a0196e0d7828d7a2269a058019c5ef455cbfc0dd4e86035`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openparse-0.7.0.tar.gz:

Publisher: publish.yml on Filimoa/open-parse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openparse-0.7.0.tar.gz
- Subject digest: 965a84ebed051063516c7e0e6e3bc7352c216d6bb54bf2786c369481689554fa
- Sigstore transparency entry: 148538785
- Sigstore integration time: Nov 13, 2024
Source repository:
- Permalink: Filimoa/open-parse@8125c35dd5acc0c1e161e101b96b0f5ad048ac2b
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/Filimoa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8125c35dd5acc0c1e161e101b96b0f5ad048ac2b
- Trigger Event: release

File details

Details for the file openparse-0.7.0-py3-none-any.whl.

File metadata

Download URL: openparse-0.7.0-py3-none-any.whl
Upload date: Nov 13, 2024
Size: 94.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for openparse-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70a07d944a5a99aa628367ede81fe7cc3b16a38fd18314e6a410c4991ba23fb6`
MD5	`9c6affd424b3547ff881eaa0f4e0c3e8`
BLAKE2b-256	`c7411c2cf979a3b2f5ac6300dc27bbcbe8ae213692c0750ac9177e74b6083305`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openparse-0.7.0-py3-none-any.whl:

Publisher: publish.yml on Filimoa/open-parse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openparse-0.7.0-py3-none-any.whl
- Subject digest: 70a07d944a5a99aa628367ede81fe7cc3b16a38fd18314e6a410c4991ba23fb6
- Sigstore transparency entry: 148538786
- Sigstore integration time: Nov 13, 2024
Source repository:
- Permalink: Filimoa/open-parse@8125c35dd5acc0c1e161e101b96b0f5ad048ac2b
- Branch / Tag: refs/tags/v0.7.0
- Owner: https://github.com/Filimoa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8125c35dd5acc0c1e161e101b96b0f5ad048ac2b
- Trigger Event: release

openparse 0.7.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

✂️ Text Splitting

🤖 ML Layout Parsers

💼 Commercial Solutions

Highlights

Example

Basic Example

Semantic Processing Example

Serializing Results

Requirements

Installation

1. Core Library

2. ML Table Detection (Optional)

Cookbooks

Documentation

Sponsors

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance