Multi-modal file parsing and chunking
Project description
dsParse
dsParse is a sub-module of dsRAG that does multimodal file parsing, semantic sectioning, and chunking. You provide a file path (and some config params) and receive nice clean chunks.
sections, chunks = parse_and_chunk(
kb_id = "sample_kb",
doc_id = "sample_doc",
file_parsing_config={
"use_vlm": True,
"vlm_config": {
"provider": "gemini",
"model": "gemini-1.5-pro-002",
}
}
file_path="path/to/file.pdf",
)
dsParse can be used on its own, as shown above, or in conjunction with a dsRAG knowledge base. To use it with dsRAG, you just use the add_document
function like normal, but set use_vlm
to True in the file_parsing_config
dictionary, and include a vlm_config
.
kb = KnowledgeBase(kb_id="mck_energy_test")
kb.add_document(
doc_id="mck_energy_report",
file_path=file_path,
document_title="McKinsey Energy Report",
file_parsing_config={
"use_vlm": True,
"vlm_config": {
"provider": "vertex_ai",
"model": "gemini-1.5-pro-002",
"project_id": os.environ["VERTEX_PROJECT_ID"],
"location": "us-central1",
}
}
)
Installation
If you want to use dsParse on its own, without installing the full dsrag
package, there is a standalone Python package available for dsParse, which can be installed with pip install dsparse
. If you already have dsrag
installed, you DO NOT need to separately install dsparse
.
To use the VLM file parsing functionality, you'll need to install one external dependency: poppler. This is used to convert PDFs to page images. On a Mac you can install it with brew install poppler
.
Multimodal file parsing
dsParse uses a vision language model (VLM) to parse documents. This has a few advantages:
- It can provide descriptions for visual elements, like images and figures.
- It can parse documents that don't have extractable text (i.e. those that require OCR).
- It can accurately parse documents with complex structures.
- It can accurately categorize page content into element types.
When it comes across an element on the page that can't be accurately represented with text alone, like an image or figure (chart, graph, diagram, etc.), it provides a text description of it. This can then be used in the embedding and retrieval pipeline.
The default model, gemini-1.5-flash-002
, is a fast and cost-effective option. gemini-1.5-pro-002
is also supported, and works extremely well, but at a higher cost. These models can be accessed through either the Gemini API or the Vertex API.
Element types
Page content is categorized into the following eight categories by default:
- NarrativeText
- Figure
- Image
- Table
- Header
- Footnote
- Footer
- Equation
You can also choose to define your own categories and the VLM will be prompted accordingly.
You can choose to exclude certain element types. By default, Header and Footer elements are excluded, as they rarely contain valuable information and they break up the flow between pages. For example, if you wanted to exclude footnotes, in addition to headers and footers, you would do: exclude_elements = ["Header", "Footer", "Footnote"]
.
Using page images for full multimodal RAG functionality
While modern VLMs, like Gemini and Claude 3.5, are now better than traditional OCR and bounding box extraction methods at converting visual elements on a page to text or bounding boxes, they still aren’t perfect. For fully visual elements, like images or charts, getting an accurate bounding box that includes all necessary surrounding context, like legends and axis titles, is only about 90% reliable with even the best VLM models. For semi-visual content, like tables and equations, converting to plain text is also not quite perfect yet. The problem with errors at the file parsing stage is that they propagate all the way to the generation stage.
For all of these element types, it’s more reliable to just send in the original page images to the generative model as context. That ensures that no context is lost, and that OCR and other parsing errors don’t propagate to the final response generated by the model. Images are no more expensive to process than extracted text (with the exception of a few models, like GPT-4o Mini, with weird image input pricing). In fact, for pages with dense text, a full page image might actually be cheaper than using the text itself.
Semantic sectioning and chunking
Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting lines for each “semantically cohesive section.” These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM also generates descriptive titles for each section. When using dsParse with a dsRAG knowledge base, these section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.
The default model for semantic sectioning is gpt-4o-mini
, but similarly strong models like gemini-1.5-flash-002
will also work well.
Cost and latency/throughput estimation
VLM file parsing
An obvious concern with using a large model like gemini-1.5-pro-002
to parse documents is the cost. Let's run the numbers:
VLM file parsing cost calculation (gemini-1.5-pro-002
)
- Image input: 1 image x $0.00032875 per image = $0.00032875
- Text input (prompt): 400 tokens x $1.25/10^6 per token = $0.000500
- Text output: 600 tokens x $5.00/10^6 per token = $0.003000
- Total: $0.00382875/page or $3.83 per 1000 pages
This is actually cheaper than most commercially available PDF parsing services. Unstructured and Azure Document Intelligence, for example, both cost $10 per 1000 pages.
What about gemini-1.5-flash-002
? Running the same calculation as above with the Gemini 1.5 Flash pricing gives a cost of $0.23 per 1000 pages. This is far cheaper than any commercially available OCR/PDF parsing service.
What about latency and throughput? Since each page is processed independently, this is a highly parallelizable problem. The main limiting factor then is the rate limits imposed by the VLM provider. The current rate limit for gemini-1.5-pro-002
is 1000 requests per minute. Since dsParse uses one request per page, that means the limit is 1000 pages per minute. Processing a single page takes around 15-20 seconds, so that's the minimum latency for processing a document.
Semantic sectioning
Semantic sectioning uses a much cheaper model, and it also uses far fewer output tokens, so it ends up being far cheaper than the file parsing step.
Semantic sectioning cost calculation (gpt-4o-mini
)
- Input: 800 tokens x $0.15/10^6 per token = $0.00012
- Output: 50 tokens x $0.60/10^6 per token = $0.00003
- Total: $0.00015/page or $0.15 per 1000 pages
Document text is processed in ~5000 token mega-chunks, which is roughly ten pages on average. But these mega-chunks have to be processed sequentially for each document. Processing each mega-chunk only takes a couple seconds, though, so even a large document of a few hundred pages will only take 20-60 seconds. Rate limits for the OpenAI API are heavily dependent on the usage tier you're in.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dsparse-0.0.2.tar.gz
.
File metadata
- Download URL: dsparse-0.0.2.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c209ed7a001526f544e4363b5f360b2fa4127f1117c7304cf1f8fea76f908fdc |
|
MD5 | d79b9df8b94922e34eeb27162b718a3f |
|
BLAKE2b-256 | 3a026b57768d41049bfd79ae50ebb7c126b199630ff5d8218973880837a83806 |
File details
Details for the file dsparse-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: dsparse-0.0.2-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af0405cb2338124b8d4ce003071c949b79111379758a763275e552fc99bfbc35 |
|
MD5 | d78039f6f1a69008d907849b62ebaa62 |
|
BLAKE2b-256 | 4f2df7231167cfcfbe1d737f49008872efbbe194042dd00153027dc1c2ae0afc |