Skip to main content

Multi-modal file parsing and chunking

Project description

dsParse

dsParse is a sub-module of dsRAG that does multimodal file parsing, semantic sectioning, and chunking. You provide a file path (and some config params) and receive nice clean chunks.

sections, chunks = parse_and_chunk(
    kb_id = "sample_kb",
    doc_id = "sample_doc",
    file_parsing_config={
        "use_vlm": True,
        "vlm_config": {
            "provider": "gemini",
            "model": "gemini-1.5-pro-002",
        }
    }
    file_path="path/to/file.pdf",
)

dsParse can be used on its own, as shown above, or in conjunction with a dsRAG knowledge base. To use it with dsRAG, you just use the add_document function like normal, but set use_vlm to True in the file_parsing_config dictionary, and include a vlm_config.

kb = KnowledgeBase(kb_id="mck_energy_test")
kb.add_document(
    doc_id="mck_energy_report",
    file_path=file_path,
    document_title="McKinsey Energy Report",
    file_parsing_config={
        "use_vlm": True,
        "vlm_config": {
            "provider": "vertex_ai",
            "model": "gemini-1.5-pro-002",
            "project_id": os.environ["VERTEX_PROJECT_ID"],
            "location": "us-central1",
        }
    }
)

Installation

If you want to use dsParse on its own, without installing the full dsrag package, there is a standalone Python package available for dsParse, which can be installed with pip install dsparse. If you already have dsrag installed, you DO NOT need to separately install dsparse.

To use the VLM file parsing functionality, you'll need to install one external dependency: poppler. This is used to convert PDFs to page images. On a Mac you can install it with brew install poppler.

Multimodal file parsing

dsParse uses a vision language model (VLM) to parse documents. This has a few advantages:

  • It can provide descriptions for visual elements, like images and figures.
  • It can parse documents that don't have extractable text (i.e. those that require OCR).
  • It can accurately parse documents with complex structures.
  • It can accurately categorize page content into element types.

When it comes across an element on the page that can't be accurately represented with text alone, like an image or figure (chart, graph, diagram, etc.), it provides a text description of it. This can then be used in the embedding and retrieval pipeline.

The default model, gemini-1.5-flash-002, is a fast and cost-effective option. gemini-1.5-pro-002 is also supported, and works extremely well, but at a higher cost. These models can be accessed through either the Gemini API or the Vertex API.

Element types

Page content is categorized into the following eight categories by default:

  • NarrativeText
  • Figure
  • Image
  • Table
  • Header
  • Footnote
  • Footer
  • Equation

You can also choose to define your own categories and the VLM will be prompted accordingly.

You can choose to exclude certain element types. By default, Header and Footer elements are excluded, as they rarely contain valuable information and they break up the flow between pages. For example, if you wanted to exclude footnotes, in addition to headers and footers, you would do: exclude_elements = ["Header", "Footer", "Footnote"].

Using page images for full multimodal RAG functionality

While modern VLMs, like Gemini and Claude 3.5, are now better than traditional OCR and bounding box extraction methods at converting visual elements on a page to text or bounding boxes, they still aren’t perfect. For fully visual elements, like images or charts, getting an accurate bounding box that includes all necessary surrounding context, like legends and axis titles, is only about 90% reliable with even the best VLM models. For semi-visual content, like tables and equations, converting to plain text is also not quite perfect yet. The problem with errors at the file parsing stage is that they propagate all the way to the generation stage.

For all of these element types, it’s more reliable to just send in the original page images to the generative model as context. That ensures that no context is lost, and that OCR and other parsing errors don’t propagate to the final response generated by the model. Images are no more expensive to process than extracted text (with the exception of a few models, like GPT-4o Mini, with weird image input pricing). In fact, for pages with dense text, a full page image might actually be cheaper than using the text itself.

Semantic sectioning and chunking

Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM also generates descriptive titles for each section. When using dsParse with a dsRAG knowledge base, these section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.

The default model for semantic sectioning is gpt-4o-mini, but similarly strong models like gemini-1.5-flash-002 will also work well.

Cost and latency/throughput estimation

VLM file parsing

An obvious concern with using a large model like gemini-1.5-pro-002 to parse documents is the cost. Let's run the numbers:

VLM file parsing cost calculation (gemini-1.5-pro-002)

  • Image input: 1 image x $0.00032875 per image = $0.00032875
  • Text input (prompt): 400 tokens x $1.25/10^6 per token = $0.000500
  • Text output: 600 tokens x $5.00/10^6 per token = $0.003000
  • Total: $0.00382875/page or $3.83 per 1000 pages

This is actually cheaper than most commercially available PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages.

What about gemini-1.5-flash-002? Running the same calculation as above with the Gemini 1.5 Flash pricing gives a cost of $0.23 per 1000 pages. This is far cheaper than any commercially available OCR/PDF parsing service.

What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for gemini-1.5-pro-002 is 1000 requests per minute. Since dsParse uses one request per page, that means the limit is 1000 pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.

Semantic sectioning

Semantic sectioning uses a much cheaper model, and it also uses far fewer output tokens, so it ends up being far cheaper than the file parsing step.

Semantic sectioning cost calculation (gpt-4o-mini)

  • Input: 800 tokens x $0.15/10^6 per token = $0.00012
  • Output: 50 tokens x $0.60/10^6 per token = $0.00003
  • Total: $0.00015/page or $0.15 per 1000 pages

Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. But these mega-chunks have to be processed sequentially for each document. Processing each mega-chunk only takes a couple seconds, though, so even a large document of a few hundred pages will only take 20-60 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsparse-0.0.2.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

dsparse-0.0.2-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file dsparse-0.0.2.tar.gz.

File metadata

  • Download URL: dsparse-0.0.2.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dsparse-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c209ed7a001526f544e4363b5f360b2fa4127f1117c7304cf1f8fea76f908fdc
MD5 d79b9df8b94922e34eeb27162b718a3f
BLAKE2b-256 3a026b57768d41049bfd79ae50ebb7c126b199630ff5d8218973880837a83806

See more details on using hashes here.

File details

Details for the file dsparse-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dsparse-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dsparse-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 af0405cb2338124b8d4ce003071c949b79111379758a763275e552fc99bfbc35
MD5 d78039f6f1a69008d907849b62ebaa62
BLAKE2b-256 4f2df7231167cfcfbe1d737f49008872efbbe194042dd00153027dc1c2ae0afc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page