Skip to main content

SIE integration for Weaviate

Project description

sie-weaviate

SIE integration for Weaviate v4.

Two integration paths

1. Client-side (this package, works now)

sie-weaviate provides vectorizer helpers that call SIE's encode() and return vectors in the format Weaviate expects. You configure collections with Configure.Vectors.self_provided() and pass vectors on insert/query.

pip install sie-weaviate
import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEVectorizer

vectorizer = SIEVectorizer(base_url="http://localhost:8080", model="BAAI/bge-m3")

client = weaviate.connect_to_local()
try:
    collection = client.collections.create(
        "Documents",
        properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
        vector_config=wvc.config.Configure.Vectors.self_provided(),
    )

    texts = ["first doc", "second doc"]
    vectors = vectorizer.embed_documents(texts)
    collection.data.insert_many([
        wvc.data.DataObject(properties={"text": t}, vector=v)
        for t, v in zip(texts, vectors)
    ])

    query_vec = vectorizer.embed_query("search text")
    results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
    client.close()

2. Server-side module (partnership, planned)

A text2vec-sie Go module for the Weaviate server that enables native vectorizer config (Configure.Vectorizer.text2vec_sie(...)). See weaviate-module-spec/ for the spec and reference implementation.

Named vectors (dense + sparse)

SIE's multi-output encode produces dense and sparse vectors in one call. Weaviate's named vectors feature stores them separately:

from sie_weaviate import SIENamedVectorizer

vectorizer = SIENamedVectorizer(
    base_url="http://localhost:8080",
    model="BAAI/bge-m3",
    output_types=["dense", "sparse"],
)

collection = client.collections.create(
    "Documents",
    properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
    vector_config=[
        wvc.config.Configure.Vectors.self_provided(name="dense"),
        wvc.config.Configure.Vectors.self_provided(name="sparse"),
    ],
)

named = vectorizer.embed_documents(["hello world"])
collection.data.insert_many([
    wvc.data.DataObject(properties={"text": "hello world"}, vector=named[0])
])

Storage note: SIE sparse vectors (SPLADE/BGE-M3) are expanded to full vocabulary length (~30K floats per document for BERT-based models) so that positional information is preserved for similarity search. At large scale this is significant storage. If you only need keyword-style hybrid search, use Weaviate's built-in BM25 instead — it requires no extra vectors:

results = collection.query.hybrid(query="search text", alpha=0.75)

Testing

# Unit tests (no server needed)
pytest

# Integration tests (requires SIE + Weaviate)
pytest -m integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sie_weaviate-0.1.8.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sie_weaviate-0.1.8-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file sie_weaviate-0.1.8.tar.gz.

File metadata

  • Download URL: sie_weaviate-0.1.8.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sie_weaviate-0.1.8.tar.gz
Algorithm Hash digest
SHA256 b5c37e90c29d6411fb6b4d2801701a38a3932fdd8eec066221713e0e29228cfd
MD5 bac4c86ffb9086161c595161af6351ca
BLAKE2b-256 c8d2218f43b5ab1cc7e136c13cc093ccf42f3a33e09f46947445531e4bb7a3f9

See more details on using hashes here.

File details

Details for the file sie_weaviate-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: sie_weaviate-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sie_weaviate-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 e08793c191bdbbbc920c71f1984aa084b134cd9ac4b9025a9bb0f8f6020a850b
MD5 9068bde7c9cf4c27a459f60435d88c6d
BLAKE2b-256 bb27703245149ea9ca8b8e3ca986ce4f78264d09b42eaf9c49c459deaa7c1cc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page