A libary for semantic similarity search
Project description
Semsis is a library for semantic similarity search. It is designed to focus on the following goals:
Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.
Maintainability: Unit tests, docstrings, and type hints are all available.
Extensibility: Additional code can be implemented as needed easily.
Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.
REQUIREMENTS
faiss (see INSTALL.md)
The other requirements are defined in
pyproject.toml
and can be installed viapip install ./
.
INSTALLATION
git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./
USAGE
Case 1: Use semsis as API
You can see the example of text search in end2end_test.py.
Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see semsis/cli/README.rst.
Encode the sentences and store in a key–value store.
from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np
TEXT = [
"They listen to jazz and he likes jazz piano like Bud Powell.",
"I really like fruites, especially I love grapes.",
"I am interested in the k-nearest-neighbor search.",
"The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
"This content is restricted.",
]
QUERYS = [
"I've implemented some k-nearest-neighbor search algorithms.",
"I often listen to jazz and I have many CDs which Bud Powell played.",
"I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"
MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2
encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
# Initialize the kvstore.
kvstore.new(dim)
for i in range(math.ceil(num_sentences / BATCH_SIZE)):
b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
kvstore.add(sentence_vectors)
Next, read the key–value store and build the kNN index.
with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
retriever.train(kvstore.key[:])
retriever.add(kvstore.key[:], kvstore.value[:])
retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
Query.
retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)
assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)
Case 2: Use semsis as command line scripts
Command line scripts are carefully designed to run efficiently for the billion-scale search. See semsis/cli/README.rst.
LICENSE
This library is published under the MIT-license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file semsis-0.1.2.tar.gz
.
File metadata
- Download URL: semsis-0.1.2.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.11 Linux/5.15.150.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d513e253feef33b052d9f28621e6e49a7e6393beed84f71a91cf181b3193a8df |
|
MD5 | 0b20a2c9edd7c5fa765861847cf732c5 |
|
BLAKE2b-256 | 22cfcf712b2e101befb3c22a34bc5a85f7408b54d4b6edf3f60f0ebc8a0b688e |
File details
Details for the file semsis-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: semsis-0.1.2-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.11 Linux/5.15.150.1-microsoft-standard-WSL2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 795aab46af6bb1d8a535601d77af5d9c0071685651b6c55267868ffb741dc4f6 |
|
MD5 | 8b177ca9555d525c2765744fb9cc5129 |
|
BLAKE2b-256 | 7bc44be75ee06903fd1ba075c5ddf1fc48f649e3ca4d40d10e1baa10b5f3a775 |