Skip to main content

Builds a searchable sqlite knowledge base out of pdf data

Project description

https://github.com/user-attachments/assets/f803902d-509e-44ca-a293-f5ff021b7e6d

pdf2sqlite

pdf2sqlite lets you convert an arbitrarily large set of PDFs into a single sqlite database.

Why? Because you might want an LLM to be able to search and query within these documents. If you have a large set of documents, you may not be able to fit them all into the model's context window, or you may find that doing so degrades performance.

In order to make information in the PDFs more discoverable, pdf2sqlite provides a variety of labeling mechanisms.

  1. You can extract a short searchable "gist" of each page (using any text completion model supported by litellm) and an "abstract" for the PDF.
  2. Figures and tabular data are identified within the PDF, and tabular data is extracted using gmft.
  3. Figures and tables can be described by a vision model.
  4. PDF Sections can be labeled with doc2vec style embedding vectors for more semantic search. These are stored in the database using sqlite-vec.

Usage

usage: pdf2sqlite [-h] -p PDFS [PDFS ...] -d DATABASE [-s SUMMARIZER] [-a 
ABSTRACTER] [-e EMBEDDER] [-v VISION_MODEL] [-t]
                  [-o] [-l LOWER_PIXEL_BOUND] [-z DECOMPRESSION_LIMIT]

convert pdfs into an easy-to-query sqlite DB

options:
  -h, --help            show this help message and exit
  -p, --pdfs PDFS [PDFS ...]
                        pdfs to add to DB
  -d, --database DATABASE
                        database where PDF will be added
  -s, --summarizer SUMMARIZER
                        an LLM to sumarize pdf pages (litellm naming conventions)
  -a, --abstracter ABSTRACTER
                        an LLM to produce an abstract (litellm naming conventions)
  -e, --embedder EMBEDDER
                        an embedding model to generate vector embeddings (litellm naming conventions)
  -v, --vision_model VISION_MODEL
                        a vision model to describe images (litellm naming conventions)
  -t, --tables          use gmft to analyze tables (will also use a vision model if available)
  -o, --offline         offline mode for gmft (blocks hugging face telemetry, solves VPN issues)
  -l, --lower_pixel_bound LOWER_PIXEL_BOUND
                        lower bound on pixel size for images
  -z, --decompression_limit DECOMPRESSION_LIMIT
                        upper bound on size for decompressed images. default 75,000,000. zero disables

Invocation

You can run the latest version easily with uvx or uv tool Here's an example invocation (assuming you have bedrock credentials in your environment):

uvx pdf2sqlite --offline -p ../data/*.pdf -d data.db -a 
"bedrock/amazon.nova-lite-v1:0" -s "bedrock/amazon.nova-lite-v1:0" -t

Integration with an LLM

Some design guidelines:

  1. Pass the database schema to the LLM. The schema will contain some comments that describe the different columns.

  2. To get the most of the database, you will probably want to write a tool that your LLM can call to convert binary pdf and image data stored in the database into images and PDF pages. A good design is to allow the LLM to pass in a table name, row id and column name, and receive the relevant content as a response. The LLM will generally be able to discern the necessary inputs from the schema, so the tool will be robust against future schema changes.

  3. A backend (like, e.g. Amazon Bedrock) that supports returning PDFs as the result of a tool call may be helpful, although it will probably work to return the PDF as a separate content block alongside a tool call result that just says "success, PDF will be delivered" or something similar.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2sqlite-0.0.3.tar.gz (134.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2sqlite-0.0.3-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf2sqlite-0.0.3.tar.gz.

File metadata

  • Download URL: pdf2sqlite-0.0.3.tar.gz
  • Upload date:
  • Size: 134.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.5

File hashes

Hashes for pdf2sqlite-0.0.3.tar.gz
Algorithm Hash digest
SHA256 61cb7c6f97bd11a3feb9f9ffe98d1e873666466dbd56fe2baea7d6e6f5f08104
MD5 1c5c3ca3b137039d7ac25b3145d6c3dd
BLAKE2b-256 6aafc5fc822975466ae8aef8336add9b5b83964dbaed72c49ed8d2a9ecf4b0ba

See more details on using hashes here.

File details

Details for the file pdf2sqlite-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pdf2sqlite-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.5

File hashes

Hashes for pdf2sqlite-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9643560abde66bcf65ea69965619e9d7da3bcba79d5adeaf70b345e5be666876
MD5 a0d3220a72fb5b7ab354c88686485598
BLAKE2b-256 e7da8f9aecfd3961cb1caf4b5bbd07f3cb2c6bd25f45588d0357629cef02edda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page