Skip to main content

A libary for semantic similarity search

Project description

Semsis is a library for semantic similarity search. It is designed to focus on the following goals:

  • Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.

  • Maintainability: Unit tests, docstrings, and type hints are all available.

  • Extensibility: Additional code can be implemented as needed easily.

  • Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.

REQUIREMENTS

  • faiss (see INSTALL.md)

  • The other requirements are defined in pyproject.toml and can be installed via pip install ./.

INSTALLATION

git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./

USAGE

Case 1: Use semsis as API

You can see the example of text search in end2end_test.py.

Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see semsis/cli/README.rst.

  1. Encode the sentences and store in a key–value store.

from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np

TEXT = [
    "They listen to jazz and he likes jazz piano like Bud Powell.",
    "I really like fruites, especially I love grapes.",
    "I am interested in the k-nearest-neighbor search.",
    "The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
    "This content is restricted.",
]
QUERYS = [
    "I've implemented some k-nearest-neighbor search algorithms.",
    "I often listen to jazz and I have many CDs which Bud Powell played.",
    "I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"

MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2

encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
    # Initialize the kvstore.
    kvstore.new(dim)
    for i in range(math.ceil(num_sentences / BATCH_SIZE)):
        b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
        sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
        kvstore.add(sentence_vectors)
  1. Next, read the key–value store and build the kNN index.

with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
    retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
    retriever.train(kvstore.key[:])
    retriever.add(kvstore.key[:], kvstore.value[:])

retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
  1. Query.

retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)

assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)

Case 2: Use semsis as command line scripts

Command line scripts are carefully designed to run efficiently for the billion-scale search. See semsis/cli/README.rst.

LICENSE

This library is published under the MIT-license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semsis-0.1.2.tar.gz (23.1 kB view hashes)

Uploaded Source

Built Distribution

semsis-0.1.2-py3-none-any.whl (31.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page