Maintain a FAISS index for specified Datasette tables
Project description
datasette-faiss
Maintain a FAISS index for specified Datasette tables
Installation
Install this plugin in the same environment as Datasette.
datasette install datasette-faiss
Usage
This plugin creates in-memory FAISS indexes for specified tables on startup.
If the tables are modified after the server has started the indexes will not (yet) pick up those changes.
Configuration
The tables to be indexed must have id
and embedding
columns. The embedding
column must be a blob
containing embeddings that are arrays of floating point numbers that have been encoded using the following Python function:
def encode(vector):
return struct.pack("f" * len(vector), *vector)
You can import that function from this package like so:
from datasette_faiss import encode
You can specify which tables should have indexes created for them by adding this to metadata.json
:
{
"plugins": {
"datasette-faiss": {
"tables": [
["blog", "embeddings"]
]
}
}
}
Each table is an array listing the database name and the table name.
If you are using metadata.yml
the configuration should look like this:
plugins:
datasette-faiss:
tables:
- ["blog", "embeddings"]
SQL functions
The plugin makes four new SQL functions available within Datasette:
faiss_search(database, table, embedding, k)
Returns the k
nearest neighbors to the embedding
found in the specified database and table. For example:
select faiss_search('blog', 'embeddings', (select embedding from embeddings where id = 3), 5)
This will return a JSON array of the five IDs of the records in the embeddings
table in the blog
database that are closest to the specified embedding. The returned value looks like this:
["1", "1249", "1011", "5", "10"]
You can use the SQLite json_each()
function to turn that into a table-like sequence that you can join against.
Here's an example query that does that:
with related as (
select value from json_each(
faiss_search(
'blog',
'embeddings',
(select embedding from embeddings limit 1),
5
)
)
)
select * from blog_entry, related
where id = value
faiss_search_with_scores(database, table, embedding, k)
Takes the same arguments as above, but the return value is a JSON array of pairs, each with an ID and a score - something like this:
[
["1", 0.0],
["1249", 0.21042244136333466],
["1011", 0.29391372203826904],
["5", 0.29505783319473267],
["10", 0.31554925441741943]
]
faiss_encode(json_vector)
Given a JSON array of floats, returns the binary embedding blob that can be used with the other functions:
select faiss_encode('[2.4, 4.1, 1.8]')
-- Returns a 12 byte blob
select hex(faiss_encode('[2.4, 4.1, 1.8]'))
-- Returns 9A991940333383406666E63F
faiss_decode(vector_blob)
The opposite of faiss_encode()
.
select faiss_decode(X'9A991940333383406666E63F')
Returns:
[2.4000000953674316, 4.099999904632568, 1.7999999523162842]
Note that floating point arithmetic results in numbers that don't quite round-trip to the exact same expected value.
Development
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd datasette-faiss
python3 -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datasette-faiss-0.1a0.tar.gz
.
File metadata
- Download URL: datasette-faiss-0.1a0.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ddc1f440917fccd01610945284b0f0aaaae4124c421dc1122f478d065d4e2e6 |
|
MD5 | 23dc3ed8335edc1c503969c7da951728 |
|
BLAKE2b-256 | c679851c77fc34ec5494cf36271105f70fb7af4795e864f8afff070dbf09e0f1 |
File details
Details for the file datasette_faiss-0.1a0-py3-none-any.whl
.
File metadata
- Download URL: datasette_faiss-0.1a0-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8695358863d2c530bb0b6c6e2cc93a5cdcf30ce0bc5fe46ab5542ba12ec97ec5 |
|
MD5 | dd7cb1d7331dfed9db572e0e87bff9c6 |
|
BLAKE2b-256 | 5018976329be29170fd646b3ca084be38cb347b70129905404e1ddbbdf440515 |