SIE integration for Weaviate
Project description
sie-weaviate
SIE integration for Weaviate v4.
Two integration paths
1. Client-side (this package, works now)
sie-weaviate provides vectorizer helpers that call SIE's encode() and return
vectors in the format Weaviate expects. You configure collections with
Configure.Vectors.self_provided() and pass vectors on insert/query.
pip install sie-weaviate
import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEVectorizer
vectorizer = SIEVectorizer(base_url="http://localhost:8080", model="BAAI/bge-m3")
client = weaviate.connect_to_local()
try:
collection = client.collections.create(
"Documents",
properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
vector_config=wvc.config.Configure.Vectors.self_provided(),
)
texts = ["first doc", "second doc"]
vectors = vectorizer.embed_documents(texts)
collection.data.insert_many([
wvc.data.DataObject(properties={"text": t}, vector=v)
for t, v in zip(texts, vectors)
])
query_vec = vectorizer.embed_query("search text")
results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
client.close()
2. Server-side module (partnership, planned)
A text2vec-sie Go module for the Weaviate server that enables native
vectorizer config (Configure.Vectorizer.text2vec_sie(...)). See
weaviate-module-spec/ for the spec and reference implementation.
Named vectors (dense + sparse)
SIE's multi-output encode produces dense and sparse vectors in one call. Weaviate's named vectors feature stores them separately:
from sie_weaviate import SIENamedVectorizer
vectorizer = SIENamedVectorizer(
base_url="http://localhost:8080",
model="BAAI/bge-m3",
output_types=["dense", "sparse"],
)
collection = client.collections.create(
"Documents",
properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
vector_config=[
wvc.config.Configure.Vectors.self_provided(name="dense"),
wvc.config.Configure.Vectors.self_provided(name="sparse"),
],
)
named = vectorizer.embed_documents(["hello world"])
collection.data.insert_many([
wvc.data.DataObject(properties={"text": "hello world"}, vector=named[0])
])
Storage note: SIE sparse vectors (SPLADE/BGE-M3) are expanded to full vocabulary length (~30K floats per document for BERT-based models) so that positional information is preserved for similarity search. At large scale this is significant storage. If you only need keyword-style hybrid search, use Weaviate's built-in BM25 instead — it requires no extra vectors:
results = collection.query.hybrid(query="search text", alpha=0.75)
Testing
# Unit tests (no server needed)
pytest
# Integration tests (requires SIE + Weaviate)
pytest -m integration
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sie_weaviate-0.1.8.tar.gz.
File metadata
- Download URL: sie_weaviate-0.1.8.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5c37e90c29d6411fb6b4d2801701a38a3932fdd8eec066221713e0e29228cfd
|
|
| MD5 |
bac4c86ffb9086161c595161af6351ca
|
|
| BLAKE2b-256 |
c8d2218f43b5ab1cc7e136c13cc093ccf42f3a33e09f46947445531e4bb7a3f9
|
File details
Details for the file sie_weaviate-0.1.8-py3-none-any.whl.
File metadata
- Download URL: sie_weaviate-0.1.8-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e08793c191bdbbbc920c71f1984aa084b134cd9ac4b9025a9bb0f8f6020a850b
|
|
| MD5 |
9068bde7c9cf4c27a459f60435d88c6d
|
|
| BLAKE2b-256 |
bb27703245149ea9ca8b8e3ca986ce4f78264d09b42eaf9c49c459deaa7c1cc2
|