Skip to main content

a vectorstore supports vector & bm25 search using sqlite or postgresql

Project description

Description

A vectorstore supports vector & bm25 search using sqlite or postgresql as backend through sqlalchemy.

Features

  • Do document CRUD
  • Do vector search
  • Do bm25 search
  • Filter results by metadata when search.
  • Filter results by source tags when search. This is similar to collection of langchain-postgres, but can filter results across different tags.
  • Use customized fts tokenize with sqlite.
  • Same API to use Sqlite as embeded and using Postgres as server
  • Support sync & async methods
  • Minimal dependencies, all results are builtin List & Dict

Install

# use sync sqlite
$ pip install sqlalchemy-vectorsotres[sqlite]

# use async sqlite
# $ pip install sqlalchemy-vectorsotres[asqlite]

# use postgres either sync or async
# $ pip install sqlalchemy-vectorsotres[postgres]

Please attention:

  1. sqlite-vec 0.1.1 not work on windows, need to install >=0.1.2.alpha9
  2. postgres use the psycopg driver, not psycopg2

Usage

Here is an example using sync sqlite:

import openai
from sqlalchemy_vectorstores import SqliteDatabase, SqliteVectorStore


DB_URL = "sqlite:///:memory:"
OPENAI_BASE_URL = "http://localhost:9997/v1" # local xinference server
OPENAI_API_KEY = "E"
EMBEDDING_MODEL = "bge-large-zh-v1.5"


client = openai.Client(base_url=OPENAI_BASE_URL, api_key=OPENAI_API_KEY)
def embed_func(text: str) -> list[float]:
    return client.embeddings.create(
        input=text,
        model=EMBEDDING_MODEL,
    ).data[0].embedding

# Using sync sqlite database. you can use other 3 combinations.
db = SqliteDatabase(DB_URL, echo=False)
vs = SqliteVectorStore(db, dim=1024, embedding_func=embed_func)


query = "Alaqua Cox"
sentences = [
    "Capri-Sun is a brand of juice concentrate–based drinks manufactured by the German company Wild and regional licensees.",
    "George V was King of the United Kingdom and the British Dominions, and Emperor of India, from 6 May 1910 until his death in 1936.",
    "Alaqua Cox is a Native American (Menominee) actress.",
    "Shohei Ohtani is a Japanese professional baseball pitcher and designated hitter for the Los Angeles Dodgers of Major League Baseball.",
    "Tamarindo, also commonly known as agua de tamarindo, is a non-alcoholic beverage made of tamarind, sugar, and water.",
]


# add sources
src_id = vs.add_source(url="file1.pdf", tags=["a", "b"], metadata={"path": "path1"})

# add documents
for s in sentences:
    vs.add_document(src_id=src_id, content=s)

# search sources by url
r = vs.search_sources(vs.db.make_filter(vs.src_table.c.url, "file1.pdf"))
print(r)

# search sources by metadata
# vs.db_make_filter is a helper method to build sqlalchemy expressions.
r = vs.search_sources(vs.db.make_filter(vs.src_table.c.metadata, "path2", "dict", "$.path"))
print(r)

# search sources by tags
r = vs.get_sources_by_tags(tags_all=["b", "a"])
print(r)

r = vs.get_sources_by_tags(tags_any=["b", "a"])
print(r)

# upsert source with id - update
vs.upsert_source({"id": src_id, "metadata": {"path": "path1", "added": True}})
r = vs.get_source_by_id(src_id)
print(r)

# upsert source without id - insert
src_id3 = vs.upsert_source({"url": "file3.docx", "metadata": {"path": "path3", "added": True}})
r = vs.get_source_by_id(src_id3)
print(r)

# list documents of source file
print("list documents of source file")
r = vs.get_documents_of_source(src_id3)

# delete source and documents/tsvector/embeddings belongs to it.
r = vs.delete_source(src_id3)

# search by vector
r = vs.search_by_vector(query)
print(r)

# search by vector with filters
filters = [
    vs.db.make_filter(vs.src_table.c.url, "file1.pdf")
]
r = vs.search_by_vector(query, filters=filters)
print(r)

# search by bm25
r = vs.search_by_bm25(query)
print(r)

Go here for more examples.

Concepts

The vectore stores informations in 4 tables:

  • All files from local dist or network are stored in a file source table with columns:

    • id, url, tags, metadata
  • Splitted documents are stored in document table:

    • id, content, metadata. Same to langchain Document
    • type. Other documents besides documents loaded from source file such as summary, Q/A pairs, etc.
    • target_id. The source document which a typed document belongs to.
  • Cut Words are stored in FTS or TSVECTOR table

  • Embeddings are stored in vector table

All functions are provided by 4 classes:

Sqlite + Sqlite-vec + Sqlite-fts Postgres + Pgvector
Sync Database SqliteDatabase PostgresDatabase
Async Database AsyncSqliteDatabase AsyncPostgresDatabase

The *Database classes manage database initialization such as create tables, load extensions. The *VectorStore classes manage document CRUD and search.

Todo

  • add prebuilt sqlite fts tokenizer using jieba to suuport chinese bm25 search.
  • support customized tokenizer for postgres

Changelog

newly released 0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

sqlalchemy_vectorstores-0.1.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file sqlalchemy_vectorstores-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sqlalchemy_vectorstores-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 628f240fc895c58d782f749af4970873814776c44edfedeb5f4d740947d3eadc
MD5 29f9789cd520613093268ed90f131993
BLAKE2b-256 0c5b36f754986020e302cabc2a92fcf6bfbb19d0ef44e0fb70304a5b95f74c28

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page