Skip to main content

A libary for semantic similarity search

Project description

Semsis is a library for semantic similarity search. It is designed to focus on the following goals:

  • Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.

  • Maintainability: Unit tests, docstrings, and type hints are all available.

  • Extensibility: Additional code can be implemented as needed easily.

  • Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.

REQUIREMENTS

  • faiss (see INSTALL.md)

  • The other requirements are defined in pyproject.toml and can be installed via pip install ./.

INSTALLATION

git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./

USAGE

Case 1: Use semsis as API

You can see the example of text search in end2end_test.py.

Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see semsis/cli/README.rst.

  1. Encode the sentences and store in a key–value store.

from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np

TEXT = [
    "They listen to jazz and he likes jazz piano like Bud Powell.",
    "I really like fruites, especially I love grapes.",
    "I am interested in the k-nearest-neighbor search.",
    "The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
    "This content is restricted.",
]
QUERYS = [
    "I've implemented some k-nearest-neighbor search algorithms.",
    "I often listen to jazz and I have many CDs which Bud Powell played.",
    "I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"

MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2

encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
    # Initialize the kvstore.
    kvstore.new(dim)
    for i in range(math.ceil(num_sentences / BATCH_SIZE)):
        b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
        sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
        kvstore.add(sentence_vectors)
  1. Next, read the key–value store and build the kNN index.

with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
    retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
    retriever.train(kvstore.key[:])
    retriever.add(kvstore.key[:], kvstore.value[:])

retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
  1. Query.

retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)

assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)

Case 2: Use semsis as command line scripts

Command line scripts are carefully designed to run efficiently for the billion-scale search. See semsis/cli/README.rst.

LICENSE

This library is published under the MIT-license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semsis-0.1.2.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

semsis-0.1.2-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file semsis-0.1.2.tar.gz.

File metadata

  • Download URL: semsis-0.1.2.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.11 Linux/5.15.150.1-microsoft-standard-WSL2

File hashes

Hashes for semsis-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d513e253feef33b052d9f28621e6e49a7e6393beed84f71a91cf181b3193a8df
MD5 0b20a2c9edd7c5fa765861847cf732c5
BLAKE2b-256 22cfcf712b2e101befb3c22a34bc5a85f7408b54d4b6edf3f60f0ebc8a0b688e

See more details on using hashes here.

File details

Details for the file semsis-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: semsis-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.11 Linux/5.15.150.1-microsoft-standard-WSL2

File hashes

Hashes for semsis-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 795aab46af6bb1d8a535601d77af5d9c0071685651b6c55267868ffb741dc4f6
MD5 8b177ca9555d525c2765744fb9cc5129
BLAKE2b-256 7bc44be75ee06903fd1ba075c5ddf1fc48f649e3ca4d40d10e1baa10b5f3a775

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page