Multi-modal file parsing and chunking

These details have not been verified by PyPI

Project links

Project description

dsParse

dsParse is a sub-module of dsRAG that does multimodal file parsing, semantic sectioning, and chunking. You provide a file path (and some config params) and receive nice clean chunks.

sections, chunks = parse_and_chunk(
    kb_id = "sample_kb",
    doc_id = "sample_doc",
    file_parsing_config={
        "use_vlm": True,
        "vlm_config": {
            "provider": "gemini",
            "model": "gemini-1.5-pro-002",
        }
    }
    file_path="path/to/file.pdf",
)

dsParse can be used on its own, as shown above, or in conjunction with a dsRAG knowledge base. To use it with dsRAG, you just use the add_document function like normal, but set use_vlm to True in the file_parsing_config dictionary, and include a vlm_config.

kb = KnowledgeBase(kb_id="mck_energy_test")
kb.add_document(
    doc_id="mck_energy_report",
    file_path=file_path,
    document_title="McKinsey Energy Report",
    file_parsing_config={
        "use_vlm": True,
        "vlm_config": {
            "provider": "vertex_ai",
            "model": "gemini-1.5-pro-002",
            "project_id": os.environ["VERTEX_PROJECT_ID"],
            "location": "us-central1",
        }
    }
)

Installation

If you want to use dsParse on its own, without installing the full dsrag package, there is a standalone Python package available for dsParse, which can be installed with pip install dsparse. If you already have dsrag installed, you DO NOT need to separately install dsparse.

To use the VLM file parsing functionality, you'll need to install one external dependency: poppler. This is used to convert PDFs to page images. On a Mac you can install it with brew install poppler.

Multimodal file parsing

dsParse uses a vision language model (VLM) to parse documents. This has a few advantages:

It can provide descriptions for visual elements, like images and figures.
It can parse documents that don't have extractable text (i.e. those that require OCR).
It can accurately parse documents with complex structures.
It can accurately categorize page content into element types.

When it comes across an element on the page that can't be accurately represented with text alone, like an image or figure (chart, graph, diagram, etc.), it provides a text description of it. This can then be used in the embedding and retrieval pipeline.

The default model, gemini-1.5-flash-002, is a fast and cost-effective option. gemini-1.5-pro-002 is also supported, and works extremely well, but at a higher cost. These models can be accessed through either the Gemini API or the Vertex API.

Element types

Page content is categorized into the following eight categories by default:

NarrativeText
Figure
Image
Table
Header
Footnote
Footer
Equation

You can also choose to define your own categories and the VLM will be prompted accordingly.

You can choose to exclude certain element types. By default, Header and Footer elements are excluded, as they rarely contain valuable information and they break up the flow between pages. For example, if you wanted to exclude footnotes, in addition to headers and footers, you would do: exclude_elements = ["Header", "Footer", "Footnote"].

Using page images for full multimodal RAG functionality

While modern VLMs, like Gemini and Claude 3.5, are now better than traditional OCR and bounding box extraction methods at converting visual elements on a page to text or bounding boxes, they still aren’t perfect. For fully visual elements, like images or charts, getting an accurate bounding box that includes all necessary surrounding context, like legends and axis titles, is only about 90% reliable with even the best VLM models. For semi-visual content, like tables and equations, converting to plain text is also not quite perfect yet. The problem with errors at the file parsing stage is that they propagate all the way to the generation stage.

For all of these element types, it’s more reliable to just send in the original page images to the generative model as context. That ensures that no context is lost, and that OCR and other parsing errors don’t propagate to the final response generated by the model. Images are no more expensive to process than extracted text (with the exception of a few models, like GPT-4o Mini, with weird image input pricing). In fact, for pages with dense text, a full page image might actually be cheaper than using the text itself.

Semantic sectioning and chunking

Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM also generates descriptive titles for each section. When using dsParse with a dsRAG knowledge base, these section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.

The default model for semantic sectioning is gpt-4o-mini, but similarly strong models like gemini-1.5-flash-002 will also work well.

Cost and latency/throughput estimation

VLM file parsing

An obvious concern with using a large model like gemini-1.5-pro-002 to parse documents is the cost. Let's run the numbers:

VLM file parsing cost calculation (gemini-1.5-pro-002)

Image input: 1 image x $0.00032875 per image = $0.00032875
Text input (prompt): 400 tokens x $1.25/10^6 per token = $0.000500
Text output: 600 tokens x $5.00/10^6 per token = $0.003000
Total: $0.00382875/page or $3.83 per 1000 pages

This is actually cheaper than most commercially available PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages.

What about gemini-1.5-flash-002? Running the same calculation as above with the Gemini 1.5 Flash pricing gives a cost of $0.23 per 1000 pages. This is far cheaper than any commercially available OCR/PDF parsing service.

What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for gemini-1.5-pro-002 is 1000 requests per minute. Since dsParse uses one request per page, that means the limit is 1000 pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.

Semantic sectioning

Semantic sectioning uses a much cheaper model, and it also uses far fewer output tokens, so it ends up being far cheaper than the file parsing step.

Semantic sectioning cost calculation (gpt-4o-mini)

Input: 800 tokens x $0.15/10^6 per token = $0.00012
Output: 50 tokens x $0.60/10^6 per token = $0.00003
Total: $0.00015/page or $0.15 per 1000 pages

Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. But these mega-chunks have to be processed sequentially for each document. Processing each mega-chunk only takes a couple seconds, though, so even a large document of a few hundred pages will only take 20-60 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Nov 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsparse-0.0.2.tar.gz (29.1 kB view details)

Uploaded Nov 12, 2024 Source

Built Distribution

dsparse-0.0.2-py3-none-any.whl (28.9 kB view details)

Uploaded Nov 12, 2024 Python 3

File details

Details for the file dsparse-0.0.2.tar.gz.

File metadata

Download URL: dsparse-0.0.2.tar.gz
Upload date: Nov 12, 2024
Size: 29.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dsparse-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c209ed7a001526f544e4363b5f360b2fa4127f1117c7304cf1f8fea76f908fdc`
MD5	`d79b9df8b94922e34eeb27162b718a3f`
BLAKE2b-256	`3a026b57768d41049bfd79ae50ebb7c126b199630ff5d8218973880837a83806`

See more details on using hashes here.

File details

Details for the file dsparse-0.0.2-py3-none-any.whl.

File metadata

Download URL: dsparse-0.0.2-py3-none-any.whl
Upload date: Nov 12, 2024
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dsparse-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af0405cb2338124b8d4ce003071c949b79111379758a763275e552fc99bfbc35`
MD5	`d78039f6f1a69008d907849b62ebaa62`
BLAKE2b-256	`4f2df7231167cfcfbe1d737f49008872efbbe194042dd00153027dc1c2ae0afc`