search through files with fts5, vectors and get reranked results. Fast
Project description
litesearch
NB If you’re reading this in GitHub readme, I recommend you read the more nicely formatted documentation format of this tutorial.
Litesearch is a lightweight library to set up a fastlite database with FTS5 and vector search capabilities using usearch.
Litesearch uses usearch sqlite extensions to provide fast vector search capabilities and combines it with sqlite’s FTS5 capabilities to provide hybrid search. - Litesearch uses fastlite, which is a lightweight wrapper around SQLite that makes SQLite database management delightful. It uses apsw rather than sqlite3 and provides best practices OOB. - Usearch is a cross-language package which provides vector search capabilities. We’re using its sqlite extensions here to provide fast vector search capabilities.
Lite search provides a simple way to setup this database using the
database()
method. You get a store with FTS5 and vector search capabilities using
the get_store() method and you can search through the contents using
the search() method.
Litesearch also provides document and code manipulation tools as part of
the data module and onnx based text encoders as part of the utils
module. - litesearch extends pymupdf Document and Page classes to
extract texts, images and links easily. - litesearch provides onnx based
text encoders which can be used to generate embeddings for documents and
queries. - litesearch provides a quick code parsing utility to parse
python files into code chunks for ingestion.
Get Started
fastlite and usearch will be installed automatically with litesearch if you do not have it already.
!pip install litesearch -qq
Litesearch only adds dependencies it needs, so you can use import *
from litesearch without worrying about heavy dependencies. > First time
import will try to setup usearch extensions and installing libsqlite3 if
you do not have it already. mac also needs an extra step to add
libsqlite3 into it’s LC_PATH. Check postfix.py for details.
from litesearch import *
database
db = database()
db.q('select sqlite_version() as sqlite_version')
[{'sqlite_version': '3.51.1'}]
Let’s try some of usearch’s distance functions
import numpy as np
embs = dict(
v1=np.ones((100,),dtype=np.float32).tobytes(), # vector of ones
v2=np.zeros((100,),dtype=np.float32).tobytes(), # vector of zeros
v3=np.full((100,),0.25,dtype=np.float32).tobytes() # vector of 0.25s
)
def dist_q(metric):
return db.q(f'''
select
distance_{metric}_f32(:v1,:v2) as {metric}_v1_v2,
distance_{metric}_f32(:v1,:v3) as {metric}_v1_v3,
distance_{metric}_f32(:v2,:v3) as {metric}_v2_v3
''', embs)
for fn in ['sqeuclidean', 'divergence', 'inner', 'cosine']: print(dist_q(fn))
[{'sqeuclidean_v1_v2': 100.0, 'sqeuclidean_v1_v3': 56.25, 'sqeuclidean_v2_v3': 6.25}]
[{'divergence_v1_v2': 34.657352447509766, 'divergence_v1_v3': 12.046551704406738, 'divergence_v2_v3': 8.66433334350586}]
[{'inner_v1_v2': 1.0, 'inner_v1_v3': -24.0, 'inner_v2_v3': 1.0}]
[{'cosine_v1_v2': 1.0, 'cosine_v1_v3': 0.0, 'cosine_v2_v3': 1.0}]
store
A store is a table with FTS5 and vector search capabilities.
store = db.get_store()
store.schema
'CREATE TABLE [store] (\n [content] TEXT NOT NULL,\n [embedding] BLOB,\n [metadata] TEXT,\n [uploaded_at] FLOAT DEFAULT CURRENT_TIMESTAMP,\n [id] INTEGER PRIMARY KEY\n)'
Let’s use a naive embedder for testing. > Checkout
FastEncode
in utils module for onnx based text encoders. > Check the examples
folder for usage. > if you have a gpu available, you can use
dtype=np.float16 for faster performance and
pip install onnxruntime-gpu
from sklearn.feature_extraction.text import TfidfVectorizer
txts, q = ['this is a text', "I'm hungry", "Let's play! shall we?"], 'playing hungry'
# this is naive vectoriser intended to showcase litesearch. In practice, use a proper text encoder.
def embed_texts(texts):
tfdiff = TfidfVectorizer(max_features=20000, stop_words='english')
return tfdiff.fit_transform(texts).toarray().astype(np.float16)
embs = embed_texts(txts + [q]) # last one is query
embs
array([[0. , 0. , 0. , 0. , 0. , 1. ],
[1. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.577, 0.577, 0. , 0.577, 0. ],
[0.619, 0. , 0. , 0.785, 0. , 0. ]], dtype=float16)
usearch also works with json embeddings, but using bytes leverages simd well.
rows = [dict(content=t, embedding=e.ravel().tobytes()) for t,e in zip(txts,embs[:-1])]
store.insert_all(rows)
<Table store (content, embedding, metadata, uploaded_at, id)>
search
You can search through results using the search method of the
database. the results are automatically reranked. Turn it ooff by
passing rrf=False
These results are not very meaningful since we’re using a naive tfidf vectoriser. Check the examples folder for more meaningful examples with onnx based text encoders.
db.search(q, embs[-1].ravel().tobytes(), columns=['id', 'content'])
[{'id': 2, 'content': "I'm hungry"},
{'id': 1, 'content': 'this is a text'},
{'id': 3, 'content': "Let's play! shall we?"}]
Turning off reranking can help you understand where the results are coming from.
db.search(q, embs[-1].ravel().tobytes(), columns=['id', 'content'], rrf=False)
{'fts': [],
'vec': [{'id': 2, 'content': "I'm hungry"},
{'id': 1, 'content': 'this is a text'},
{'id': 3, 'content': "Let's play! shall we?"}]}
Next steps
- Check out the
datamodule for document and code parsing utilities. - Check out the
utilsmodule for onnx based text encoders. - Check out the
examplesfolder for complete examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litesearch-0.0.11.tar.gz.
File metadata
- Download URL: litesearch-0.0.11.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6214ef1d21c6078d220a6a5eed9360c59730554e25ea081e0ef35051d220345
|
|
| MD5 |
bb0af038659b1e4f5cfba403a11e37a6
|
|
| BLAKE2b-256 |
a69dbf7184fea97b7bb7344a5c8cfd700a25a7e5ca9cb9d2049ce0af79d0fcad
|
File details
Details for the file litesearch-0.0.11-py3-none-any.whl.
File metadata
- Download URL: litesearch-0.0.11-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d3f1349b5ea1f9f00e8426b83aeb6dea3f793793550ac9e26a65a682819be2d
|
|
| MD5 |
573cd4a39bc922137bce5120565db73a
|
|
| BLAKE2b-256 |
0efcfcdcdaab698059909ea4d81a4c14eca8207d37819443b962bf7619ea5c00
|