Skip to main content

Maintain a FAISS index for specified Datasette tables

Project description

datasette-faiss

PyPI Changelog Tests License

Maintain a FAISS index for specified Datasette tables

See Semantic search answers: Q&A against documentation with GPT3 + OpenAI embeddings for background on this project.

Installation

Install this plugin in the same environment as Datasette.

datasette install datasette-faiss

Usage

This plugin creates in-memory FAISS indexes for specified tables on startup, using an IndexFlatL2 FAISS index type.

If the tables are modified after the server has started the indexes will not (yet) pick up those changes.

Configuration

The tables to be indexed must have id and embedding columns. The embedding column must be a blob containing embeddings that are arrays of floating point numbers that have been encoded using the following Python function:

def encode(vector):
    return struct.pack("f" * len(vector), *vector)

You can import that function from this package like so:

from datasette_faiss import encode

You can specify which tables should have indexes created for them by adding this to metadata.json:

{
    "plugins": {
        "datasette-faiss": {
            "tables": [
                ["blog", "embeddings"]
            ]
        }
    }
}

Each table is an array listing the database name and the table name.

If you are using metadata.yml the configuration should look like this:

plugins:
  datasette-faiss:
    tables:
    - ["blog", "embeddings"]

SQL functions

The plugin makes four new SQL functions available within Datasette:

faiss_search(database, table, embedding, k)

Returns the k nearest neighbors to the embedding found in the specified database and table. For example:

select faiss_search('blog', 'embeddings', (select embedding from embeddings where id = 3), 5)

This will return a JSON array of the five IDs of the records in the embeddings table in the blog database that are closest to the specified embedding. The returned value looks like this:

["1", "1249", "1011", "5", "10"]

You can use the SQLite json_each() function to turn that into a table-like sequence that you can join against.

Here's an example query that does that:

with related as (
  select value from json_each(
    faiss_search(
      'blog',
      'embeddings',
      (select embedding from embeddings limit 1),
      5
    )
  )
)
select * from blog_entry, related
where id = value

faiss_search_with_scores(database, table, embedding, k)

Takes the same arguments as above, but the return value is a JSON array of pairs, each with an ID and a score - something like this:

[
    ["1", 0.0],
    ["1249", 0.21042244136333466],
    ["1011", 0.29391372203826904],
    ["5", 0.29505783319473267],
    ["10", 0.31554925441741943]
]

faiss_encode(json_vector)

Given a JSON array of floats, returns the binary embedding blob that can be used with the other functions:

select faiss_encode('[2.4, 4.1, 1.8]')
-- Returns a 12 byte blob
select hex(faiss_encode('[2.4, 4.1, 1.8]'))
-- Returns 9A991940333383406666E63F

faiss_decode(vector_blob)

The opposite of faiss_encode().

select faiss_decode(X'9A991940333383406666E63F')

Returns:

[2.4000000953674316, 4.099999904632568, 1.7999999523162842]

Note that floating point arithmetic results in numbers that don't quite round-trip to the exact same expected value.

faiss_agg(id, embedding, compare_embedding, k)

This aggregate function can be used to find the k nearest neighbors to compare_embedding for each unique value of id in the table. For example:

select faiss_agg(
    id, embedding, (select embedding from embeddings where id = 3), 5
) from embeddings

Unlike the faiss_search() function, this does not depend on the per-table index that the plugin creates when it first starts running. Instead, an index is built every time the aggregation function is run.

This means that it should only be used on smaller sets of values - once you get above 10,000 or so the performance from this function is likely to become prohibitively expensive.

The function returns a JSON array of IDs representing the k rows with the closest distance scores, like this:

[1324, 344, 5562, 553, 2534]

You can use the json_each() function to turn that into a table-like sequence that you can join against.

Try an example fais_agg() query.

faiss_agg_with_scores(id, embedding, compare_embedding, k)

This is similar to the faiss_agg() aggregate function but it returns a list of pairs, each with an ID and the corresponding score - something that looks like this (if k was 2):

[[2412, 0.25], [1245, 24.25]]

Try an example fais_agg_with_scores() query.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd datasette-faiss
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasette_faiss-0.2.1.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

datasette_faiss-0.2.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file datasette_faiss-0.2.1.tar.gz.

File metadata

  • Download URL: datasette_faiss-0.2.1.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for datasette_faiss-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f41fa89637f368a460f1d4a4ebf083c33c99d6060fff2a4c54afc3561d6522a9
MD5 7a7e948a04cf675f3a56af4ada9629b0
BLAKE2b-256 2d61674028fdf92b29c488caebe473dbc79813b4c58397ee90a1ad76114b2194

See more details on using hashes here.

File details

Details for the file datasette_faiss-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for datasette_faiss-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3f1989d9def3a3d6713200ed022bd5bfcda66f34e01a3f835222896daefc9717
MD5 5974535a0cdc1d141b3b9d700cf2fbeb
BLAKE2b-256 218ee221ae410407953af2ad0824c2d851a3475781ddac284b4b82617c198b3f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page