A libary for semantic similarity search
Project description
Semsis is a library for semantic similarity search. It is designed to focus on the following goals:
Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.
Maintainability: Unit tests, docstrings, and type hints are all available.
Extensibility: Additional code can be implemented as needed easily.
Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.
REQUIREMENTS
faiss (see INSTALL.md)
The other requirements are defined in
pyproject.tomland can be installed viapip install ./.
INSTALLATION
via pip:
pip install semsis
from the source:
git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./
from the source with uv:
git clone https://github.com/de9uch1/semsis.git
cd semsis/
uv sync
USAGE
Case 1: Use semsis as API
You can see the example of text search in end2end_test.py.
Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see src/semsis/cli/README.rst.
Encode the sentences and store in a key–value store.
from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np
TEXT = [
"They listen to jazz and he likes jazz piano like Bud Powell.",
"I really like fruites, especially I love grapes.",
"I am interested in the k-nearest-neighbor search.",
"The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
"This content is restricted.",
]
QUERYS = [
"I've implemented some k-nearest-neighbor search algorithms.",
"I often listen to jazz and I have many CDs which Bud Powell played.",
"I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"
MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2
encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
# Initialize the kvstore.
kvstore.new(dim)
for i in range(math.ceil(num_sentences / BATCH_SIZE)):
b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
kvstore.add(sentence_vectors)
Next, read the key–value store and build the kNN index.
with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
retriever.train(kvstore.key[:])
retriever.add(kvstore.key[:], kvstore.value[:])
retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
Query.
retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)
assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)
Case 2: Use semsis as command line scripts
Command line scripts are carefully designed to run efficiently for the billion-scale search. See src/semsis/cli/README.rst.
LICENSE
This library is published under the MIT-license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semsis-0.1.3.tar.gz.
File metadata
- Download URL: semsis-0.1.3.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.26
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71192067c068f35b87ffaed635939119039d7dbff56bdf21a6cc18636a092862
|
|
| MD5 |
56e9bc180e9e8a0e5055c71b9a818134
|
|
| BLAKE2b-256 |
1a763a2250d7a6be3febc453ba35cb89af15face30d17ea7d083eec4fff3037e
|
File details
Details for the file semsis-0.1.3-py3-none-any.whl.
File metadata
- Download URL: semsis-0.1.3-py3-none-any.whl
- Upload date:
- Size: 33.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.26
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0837c24f02c63c93f5625e6368e1ea7ed62bc7e1dd95ae8447c17c554b6cab3c
|
|
| MD5 |
fc61fff25dcd66b70bbdd37e6ab9a88d
|
|
| BLAKE2b-256 |
e38a56927037a0e1ad6d94d44d17c98bfb3ad713d1d7d81207586f9557252eaa
|