Skip to main content

Save OpenAI API results to a SQLite database

Project description

openai-to-sqlite

PyPI Changelog Tests License

This tool provides utilities for interacting with OpenAI APIs and storing the results in a SQLite database.

See Semantic search answers: Q&A against documentation with GPT3 + OpenAI embeddings for background on this project.

Installation

Install this tool using pip:

pip install openai-to-sqlite

Configuration

You will need an OpenAI API key to use this tool.

You can create one at https://beta.openai.com/account/api-keys

You can then either set the API key as an environment variable:

export OPENAI_API_KEY=sk-...

Or pass it to each command using the --token sk-... option.

Calling OpenAI APIs with SQL functions

The openai-to-sqlite query command can be used to execute SQL queries that call OpenAI APIs.

Functions available are:

  • chatgpt(prompt) - call the OpenAI Chat API using model gpt-3.5-turbo with the specified prompt.
  • chatgpt(prompt, system) - call that API with the prompt and the specified system prompt.

More functions are planned in the future.

Here's how to use this command to run basic sentiment analysis against content in a table:

openai-to-sqlite query database.db "
  update messages set sentiment = chatgpt(
    'Sentiment analysis for this message: ' || message ||
    ' - ONLY return a lowercase string from: positive, negative, neutral, unknown'
  )
  where sentiment not in ('positive', 'negative', 'neutral', 'unknown')
    or sentiment is null
"

This updates the sentiment column in a table called messages. It populates it with the response from the specified prompt.

The command will display a progress bar indicating how many rows are being processed.

You can add an empty sentiment column to a table using sqlite-utils like this:

sqlite-utils add-column database.db messages sentiment

Embeddings

The embeddings command can be used to calculate and store OpenAI embeddings for strings of text.

Each embedding has a cost, so be sure to familiarize yourself with the pricing for the embedding model.

The command can accept data in four different ways:

  • As a JSON file containing a list of objects
  • As a CSV file
  • As a TSV file
  • By running queries against a SQLite database

For all of these formats there should be an id column, followed by one or more text columns.

The ID will be stored as the content ID. Any other columns will be concatenated together and used as the text to be embedded.

The embeddings from the API will then be saved as binary blobs in the embeddings table of the specified SQLite database - or another table, if you pass the -t/--table option.

JSON, CSV and TSV

Given a CSV file like this:

id,content
1,This is a test
2,This is another test

Embeddings can be stored like so:

openai-to-sqlite embeddings embeddings.db data.csv

The resulting schema looks like this:

CREATE TABLE [embeddings] (
   [id] TEXT PRIMARY KEY,
   [embedding] BLOB
);

The same data can be provided as TSV data:

id    content
1     This is a test
2     This is another test

Then imported like this:

openai-to-sqlite embeddings embeddings.db data.tsv

Or as JSON data:

[
  {"id": 1, "content": "This is a test"},
  {"id": 2, "content": "This is another test"}
]

Imported like this:

openai-to-sqlite embeddings embeddings.db data.json

In each of these cases the tool automatically detects the format of the data. It does this by inspecting the data itself - it does not consider the file extension.

If the automatic detection is not working, you can pass --format json, csv or tsv to explicitly specify a format:

openai-to-sqlite embeddings embeddings.db data.tsv --format tsv

Importing data from standard input

You can use a filename of - to pipe data in to standard input:

cat data.tsv | openai-to-sqlite embeddings embeddings.db -

Data from a SQL query

The --sql option can be used to read data to be embedded from the attached SQLite database. The query must return an id column and one or more text columns to be embedded.

openai-to-sqlite embeddings content.db \
  --sql "select id, title from documents"

This will create a embeddings table in the content.db database and populate it with embeddings calculated from the title column in that query.

You can also store embeddings in one database while reading data from another database, using the --attach alias filename.db option:

openai-to-sqlite embeddings embeddings.db \
  --attach documents documents.db \
  --sql "select id, title from documents.documents"

A progress bar will be displayed when using --sql that indicates how long the embeddings are likely to take to calculate.

The CSV/TSV/JSON options do not correctly display the progress bar. You can work around this by importing your data into SQLite first (e.g. using sqlite-utils) and then running the embeddings using --sql.

Batching

Embeddings will be sent to the OpenAI embeddings API in batches of 100. If you know that your data is short strings you can increase the batch size, up to 2048, using the --batch-size option:

openai-to-sqlite embeddings embeddings.db data.csv --batch-size 2048

Working with the stored embeddings

The embedding column is a SQLite blob containing 1536 floating point numbers encoded as a sequence of 4 byte values.

You can extract them back to an array of floating point values in Python like this:

import struct

vector = struct.unpack(
    "f" * 1536, binary_embedding
)

Searching embeddings with the search command

Having saved the embeddings for content, you can run searches using the search command:

openai-to-sqlite search embeddings.db 'this is my search term'

The output will be a list of cosine similarity scores and content IDs:

% openai-to-sqlite search blog.db 'cool datasette demo'
0.843 7849
0.830 8036
0.828 8195
0.826 8098
0.818 8086
0.817 8171
0.816 8121
0.815 7860
0.815 7872
0.814 8169

Add the -t/--table option if your embeddings are stored in a different table:

openai-to-sqlite search content.db 'this is my search term' -t documents

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd openai-to-sqlite
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openai-to-sqlite-0.3.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

openai_to_sqlite-0.3-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file openai-to-sqlite-0.3.tar.gz.

File metadata

  • Download URL: openai-to-sqlite-0.3.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for openai-to-sqlite-0.3.tar.gz
Algorithm Hash digest
SHA256 0e9a662871a6edd30af96fd3f73ab9b1caec06d4d6847c3e68dc9dcf10c39421
MD5 8c93184674cce210a71e0b52d8c11d16
BLAKE2b-256 fb028b78df0f240d6cea4baadf777208e6cfa82f0e34299b40fc2967a67fc279

See more details on using hashes here.

File details

Details for the file openai_to_sqlite-0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for openai_to_sqlite-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6addb0126fea6ea3abe7c0f55a85c41f55e47a1df3cbaa880e3449fc1586b02a
MD5 eb82efa3ab2871f254f09acb71403c4b
BLAKE2b-256 3875b70c80a8fef87b6a74bf6da17226f25de7bf97210e9b7269e1da486b97d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page