Skip to main content

A libary for semantic similarity search

Project description

PyPI version PyPI license CI

Semsis is a library for semantic similarity search. It is designed to focus on the following goals:

  • Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.

  • Maintainability: Unit tests, docstrings, and type hints are all available.

  • Extensibility: Additional code can be implemented as needed easily.

  • Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.

REQUIREMENTS

  • faiss (see INSTALL.md)

  • The other requirements are defined in pyproject.toml and can be installed via pip install ./.

INSTALLATION

via pip:

pip install semsis

from the source:

git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./

from the source with uv:

git clone https://github.com/de9uch1/semsis.git
cd semsis/
uv sync

USAGE

Case 1: Use semsis as API

You can see the example of text search in end2end_test.py.

Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see src/semsis/cli/README.rst.

  1. Encode the sentences and store in a key–value store.

from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np

TEXT = [
    "They listen to jazz and he likes jazz piano like Bud Powell.",
    "I really like fruites, especially I love grapes.",
    "I am interested in the k-nearest-neighbor search.",
    "The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
    "This content is restricted.",
]
QUERYS = [
    "I've implemented some k-nearest-neighbor search algorithms.",
    "I often listen to jazz and I have many CDs which Bud Powell played.",
    "I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"

MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2

encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
    # Initialize the kvstore.
    kvstore.new(dim)
    for i in range(math.ceil(num_sentences / BATCH_SIZE)):
        b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
        sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
        kvstore.add(sentence_vectors)
  1. Next, read the key–value store and build the kNN index.

with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
    retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
    retriever.train(kvstore.key[:])
    retriever.add(kvstore.key[:], kvstore.value[:])

retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
  1. Query.

retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)

assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)

Case 2: Use semsis as command line scripts

Command line scripts are carefully designed to run efficiently for the billion-scale search. See src/semsis/cli/README.rst.

LICENSE

This library is published under the MIT-license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semsis-0.1.3.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semsis-0.1.3-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file semsis-0.1.3.tar.gz.

File metadata

  • Download URL: semsis-0.1.3.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for semsis-0.1.3.tar.gz
Algorithm Hash digest
SHA256 71192067c068f35b87ffaed635939119039d7dbff56bdf21a6cc18636a092862
MD5 56e9bc180e9e8a0e5055c71b9a818134
BLAKE2b-256 1a763a2250d7a6be3febc453ba35cb89af15face30d17ea7d083eec4fff3037e

See more details on using hashes here.

File details

Details for the file semsis-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: semsis-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for semsis-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0837c24f02c63c93f5625e6368e1ea7ed62bc7e1dd95ae8447c17c554b6cab3c
MD5 fc61fff25dcd66b70bbdd37e6ab9a88d
BLAKE2b-256 e38a56927037a0e1ad6d94d44d17c98bfb3ad713d1d7d81207586f9557252eaa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page