Skip to main content

Simple full-text search library with SQL backend

Project description

Sifts – Simple Full Text & Semantic Search

🔎 Sifts is a simple but powerful Python package for managing and querying document collections with support for both SQLite and PostgreSQL databases.

It is designed to efficiently handle full-text search and vector search, making it ideal for applications that involve large-scale text data retrieval.

Features

  • Dual Database Support: Sifts works with both SQLite and PostgreSQL, offering the simplicity of SQLite for lightweight applications and the scalability of PostgreSQL for larger, production environments.
  • Full-Text Search (FTS): Perform advanced text search queries with full-text search support.
  • Vector Search: Integrate with embedding models to perform vector-based similarity searches, perfect for applications involving natural language processing.
  • Flexible Querying: Supports complex queries with filtering, ordering, and pagination.

Background

The main idea of Sifts is to leverage the built-in full-text search capabilities in SQLite and PostgreSQL and to make them available via a unified, Pythonic API. You can use SQLite for small projects or development and trivially switch to PostgreSQL to scale your application.

For vector search, cosine similarity is computed in PostgreSQL via the pgvector extension, while with SQLite similarity is calculated in memory.

Sifts does not come with a server mode as it's meant as a library to be imported by other apps. The original motivation for its development was to replace whoosh as search backend in Gramps Web, which is based on Flask.

Installation

You can install Sifts via pip:

pip install sifts

Usage

Full-text search

import sifts

# by default, creates a new SQLite database in the working directory
collection = sifts.Collection(name="my_collection")

# Add docs to the index. Can also update and delete.
collection.add(
    documents=["Lorem ipsum dolor", "sit amet"],
    metadatas=[{"foo": "bar"}, {"foo": "baz"}], # otpional, can filter on these
    ids=["doc1", "doc2"], # unique for each doc. Uses UUIDs if omitted
)

results = collection.query(
    "Lorem",
    # limit=2,  # optionally limit the number of results
    # where={"foo": "bar"},  # optional filter
    # order_by="foo",  # sort by metadata key (rather than rank)
)

The API is inspired by chroma.

Full-text search syntax

Sifts supports the following search syntax:

  • Search for individual words
  • Search for multiple words (will match documents where all words are present)
  • and operator
  • or operator
  • * wildcard (in SQLite, supported anywhere in the search term, in PostgreSQL only at the end of the search term)

The search syntax is the same regardless of backend.

Vector search (semantic search)

Sifts can also be used as vector store, used for semantic search engines or retrieval-augmented generation (RAG) with large language models (LLMs).

Simply pass the embedding_function to the Collection factory to enable vector storage and set vector_search=True in the query method. For instance, using the Sentence Transformers library,

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-small")

def embedding_function(queries: list[str]):
    return model.encode(queries)

collection = sifts.Collection(
    db_url="sqlite:///vector_store.db",
    name="my_vector_store",
    embedding_function=embedding_function
)

# Adding vector data to the collection
collection.add(["This is a test sentence.", "Another example query."])

# Querying the collection with semantic search
results = collection.query("Find similar sentences.", vector_search=True)

PostgreSQL collections require installing and enabling the pgvector extension.

Updating and Deleting Documents

Documents can be updated or deleted using their IDs.

# Update a document
collection.update(ids=["document_id"], contents=["Updated content"])

# Delete a document
collection.delete(ids=["document_id"])

Contributing

Contributions are welcome! Feel free to create an issue if you encounter problems or have an improvement suggestion, and even better submit a PR along with it!

License

Sifts is licensed under the MIT License. See the LICENSE file for details.


Happy Sifting! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sifts-1.0.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

sifts-1.0.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file sifts-1.0.0.tar.gz.

File metadata

  • Download URL: sifts-1.0.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for sifts-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e8a9d3eaf181c02b4ed81c8c7e70416eb72f516f6b70e2a63c7b03997bddab92
MD5 c110157505bbbdcde261c4f4894f14f4
BLAKE2b-256 320d96ed5e1846235a2447ea94739f4ba3469900df5bb32763954aabac55d0a7

See more details on using hashes here.

File details

Details for the file sifts-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: sifts-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for sifts-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 49a5af065a5b781bf5b100133b4c192ad4b26ca469125d2afbd09bb8634a05ad
MD5 22d935717603feb8d95655e97af89514
BLAKE2b-256 8757f20833c6a5a11951eae7f9a6e02ee06345099e2faedd822d9f4cbc63c68d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page