Builds a searchable sqlite knowledge base out of pdf data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gleachkr

These details have not been verified by PyPI

Project description

https://github.com/user-attachments/assets/f803902d-509e-44ca-a293-f5ff021b7e6d

pdf2sqlite

pdf2sqlite lets you convert an arbitrarily large set of PDFs into a single sqlite database.

Why? Because you might want an LLM to be able to search and query within these documents. If you have a large set of documents, you may not be able to fit them all into the model's context window, or you may find that doing so degrades performance.

In order to make information in the PDFs more discoverable, pdf2sqlite provides a variety of labeling mechanisms.

You can extract a short searchable "gist" of each page (using any text completion model supported by litellm) and an "abstract" for the PDF.
Figures and tabular data are identified within the PDF, and tabular data is extracted using gmft.
Figures and tables can be described by a vision model.
PDF Sections can be labeled with doc2vec style embedding vectors for more semantic search. These are stored in the database using sqlite-vec.

Usage

usage: pdf2sqlite [-h] -p PDFS [PDFS ...] -d DATABASE [-s SUMMARIZER] [-a 
ABSTRACTER] [-e EMBEDDER] [-v VISION_MODEL] [-t]
                  [-o] [-l LOWER_PIXEL_BOUND] [-z DECOMPRESSION_LIMIT]

convert pdfs into an easy-to-query sqlite DB

options:
  -h, --help            show this help message and exit
  -p, --pdfs PDFS [PDFS ...]
                        pdfs to add to DB
  -d, --database DATABASE
                        database where PDF will be added
  -s, --summarizer SUMMARIZER
                        an LLM to sumarize pdf pages (litellm naming conventions)
  -a, --abstracter ABSTRACTER
                        an LLM to produce an abstract (litellm naming conventions)
  -e, --embedder EMBEDDER
                        an embedding model to generate vector embeddings (litellm naming conventions)
  -v, --vision_model VISION_MODEL
                        a vision model to describe images (litellm naming conventions)
  -t, --tables          use gmft to analyze tables (will also use a vision model if available)
  -o, --offline         offline mode for gmft (blocks hugging face telemetry, solves VPN issues)
  -l, --lower_pixel_bound LOWER_PIXEL_BOUND
                        lower bound on pixel size for images
  -z, --decompression_limit DECOMPRESSION_LIMIT
                        upper bound on size for decompressed images. default 75,000,000. zero disables

Invocation

You can run the latest version easily with uvx or uv tool Here's an example invocation (assuming you have bedrock credentials in your environment):

uvx pdf2sqlite --offline -p ../data/*.pdf -d data.db -a 
"bedrock/amazon.nova-lite-v1:0" -s "bedrock/amazon.nova-lite-v1:0" -t

Integration with an LLM

Some design guidelines:

Pass the database schema to the LLM. The schema will contain some comments that describe the different columns.
To get the most of the database, you will probably want to write a tool that your LLM can call to convert binary pdf and image data stored in the database into images and PDF pages. A good design is to allow the LLM to pass in a table name, row id and column name, and receive the relevant content as a response. The LLM will generally be able to discern the necessary inputs from the schema, so the tool will be robust against future schema changes.
A backend (like, e.g. Amazon Bedrock) that supports returning PDFs as the result of a tool call may be helpful, although it will probably work to return the PDF as a separate content block alongside a tool call result that just says "success, PDF will be delivered" or something similar.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gleachkr

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.3

Oct 23, 2025

This version

0.0.2

Oct 10, 2025

0.0.1

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2sqlite-0.0.2.tar.gz (117.6 kB view details)

Uploaded Oct 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2sqlite-0.0.2-py3-none-any.whl (20.0 kB view details)

Uploaded Oct 10, 2025 Python 3

File details

Details for the file pdf2sqlite-0.0.2.tar.gz.

File metadata

Download URL: pdf2sqlite-0.0.2.tar.gz
Upload date: Oct 10, 2025
Size: 117.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.1

File hashes

Hashes for pdf2sqlite-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a4ab4c9fa058b24d73b0f3dfe08cd4c2981e501b86be8d66509ea49fa43f3664`
MD5	`bd770ea4349f623389d8600d602cefde`
BLAKE2b-256	`3fe158648cc2253b937ab913a298e5644f7748f9dad0cc0551b4e8fd081bea9f`

See more details on using hashes here.

File details

Details for the file pdf2sqlite-0.0.2-py3-none-any.whl.

File metadata

Download URL: pdf2sqlite-0.0.2-py3-none-any.whl
Upload date: Oct 10, 2025
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.1

File hashes

Hashes for pdf2sqlite-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`985b5aa6e7e945c8bb046f5933f760316bb15069bfed369f613deb754c262f02`
MD5	`b1b93644f2f7fa9a433849c3481b4df3`
BLAKE2b-256	`dd9368456552cd0089271579132c90954c4344221a898cd7b8fa0da19f95e3f4`

See more details on using hashes here.

pdf2sqlite 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

pdf2sqlite

Usage

Invocation

Integration with an LLM

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes