Builds a searchable sqlite knowledge base out of pdf data
Project description
https://github.com/user-attachments/assets/f803902d-509e-44ca-a293-f5ff021b7e6d
pdf2sqlite
pdf2sqlite lets you convert an arbitrarily large set of PDFs into a single sqlite database.
Why? Because you might want an LLM to be able to search and query within these documents. If you have a large set of documents, you may not be able to fit them all into the model's context window, or you may find that doing so degrades performance.
In order to make information in the PDFs more discoverable, pdf2sqlite provides a variety of labeling mechanisms.
- You can extract a short searchable "gist" of each page (using any text completion model supported by litellm) and an "abstract" for the PDF.
- Figures and tabular data are identified within the PDF, and tabular data is extracted using gmft.
- Figures and tables can be described by a vision model.
- PDF Sections can be labeled with doc2vec style embedding vectors for more semantic search. These are stored in the database using sqlite-vec.
Usage
usage: pdf2sqlite [-h] -p PDFS [PDFS ...] -d DATABASE [-s SUMMARIZER] [-a
ABSTRACTER] [-e EMBEDDER] [-v VISION_MODEL] [-t]
[-o] [-l LOWER_PIXEL_BOUND] [-z DECOMPRESSION_LIMIT]
convert pdfs into an easy-to-query sqlite DB
options:
-h, --help show this help message and exit
-p, --pdfs PDFS [PDFS ...]
pdfs to add to DB
-d, --database DATABASE
database where PDF will be added
-s, --summarizer SUMMARIZER
an LLM to sumarize pdf pages (litellm naming conventions)
-a, --abstracter ABSTRACTER
an LLM to produce an abstract (litellm naming conventions)
-e, --embedder EMBEDDER
an embedding model to generate vector embeddings (litellm naming conventions)
-v, --vision_model VISION_MODEL
a vision model to describe images (litellm naming conventions)
-t, --tables use gmft to analyze tables (will also use a vision model if available)
-o, --offline offline mode for gmft (blocks hugging face telemetry, solves VPN issues)
-l, --lower_pixel_bound LOWER_PIXEL_BOUND
lower bound on pixel size for images
-z, --decompression_limit DECOMPRESSION_LIMIT
upper bound on size for decompressed images. default 75,000,000. zero disables
Invocation
You can run the latest version easily with uvx or uv tool Here's an
example invocation (assuming you have bedrock credentials in your
environment):
uvx pdf2sqlite --offline -p ../data/*.pdf -d data.db -a
"bedrock/amazon.nova-lite-v1:0" -s "bedrock/amazon.nova-lite-v1:0" -t
Integration with an LLM
Some design guidelines:
-
Pass the database schema to the LLM. The schema will contain some comments that describe the different columns.
-
To get the most of the database, you will probably want to write a tool that your LLM can call to convert binary pdf and image data stored in the database into images and PDF pages. A good design is to allow the LLM to pass in a table name, row id and column name, and receive the relevant content as a response. The LLM will generally be able to discern the necessary inputs from the schema, so the tool will be robust against future schema changes.
-
A backend (like, e.g. Amazon Bedrock) that supports returning PDFs as the result of a tool call may be helpful, although it will probably work to return the PDF as a separate content block alongside a tool call result that just says "success, PDF will be delivered" or something similar.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2sqlite-0.0.2.tar.gz.
File metadata
- Download URL: pdf2sqlite-0.0.2.tar.gz
- Upload date:
- Size: 117.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4ab4c9fa058b24d73b0f3dfe08cd4c2981e501b86be8d66509ea49fa43f3664
|
|
| MD5 |
bd770ea4349f623389d8600d602cefde
|
|
| BLAKE2b-256 |
3fe158648cc2253b937ab913a298e5644f7748f9dad0cc0551b4e8fd081bea9f
|
File details
Details for the file pdf2sqlite-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pdf2sqlite-0.0.2-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
985b5aa6e7e945c8bb046f5933f760316bb15069bfed369f613deb754c262f02
|
|
| MD5 |
b1b93644f2f7fa9a433849c3481b4df3
|
|
| BLAKE2b-256 |
dd9368456552cd0089271579132c90954c4344221a898cd7b8fa0da19f95e3f4
|