Skip to main content

SIE integration for Weaviate

Project description

sie-weaviate

SIE integration for Weaviate v4.

Two integration paths

1. Client-side (this package, works now)

sie-weaviate provides vectorizer and enrichment helpers that call SIE's encode() and extract() and return data in the format Weaviate expects. You configure collections with Configure.Vectors.self_provided() and pass vectors on insert/query.

pip install sie-weaviate
import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEVectorizer

vectorizer = SIEVectorizer(base_url="http://localhost:8080", model="BAAI/bge-m3")

client = weaviate.connect_to_local()
try:
    collection = client.collections.create(
        "Documents",
        properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
        vector_config=wvc.config.Configure.Vectors.self_provided(),
    )

    texts = ["first doc", "second doc"]
    vectors = vectorizer.embed_documents(texts)
    collection.data.insert_many([
        wvc.data.DataObject(properties={"text": t}, vector=v)
        for t, v in zip(texts, vectors)
    ])

    query_vec = vectorizer.embed_query("search text")
    results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
    client.close()

2. Server-side module (partnership, planned)

A text2vec-sie Go module for the Weaviate server that enables native vectorizer config (Configure.Vectorizer.text2vec_sie(...)). See weaviate-module-spec/ for the spec and reference implementation.

Named vectors (dense + multivector)

SIENamedVectorizer produces multiple vector types in one SIE call. Use it with ColBERT models that output both dense and multivector (per-token) embeddings:

from sie_weaviate import SIENamedVectorizer

vectorizer = SIENamedVectorizer(
    base_url="http://localhost:8080",
    model="jinaai/jina-colbert-v2",
    output_types=["dense", "multivector"],
)

collection = client.collections.create(
    "Documents",
    properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
    vector_config=[
        wvc.config.Configure.Vectors.self_provided(name="dense"),
        wvc.config.Configure.Vectors.self_provided(name="multivector"),
    ],
)

named = vectorizer.embed_documents(["hello world"])
collection.data.insert_many([
    wvc.data.DataObject(properties={"text": "hello world"}, vector=named[0])
])

For hybrid search, Weaviate has built-in BM25 — no extra vectors needed:

results = collection.query.hybrid(query="search text", alpha=0.75)

Document enrichment for Query Agent

SIEDocumentEnricher combines SIE's embedding and entity extraction pipelines to produce documents with dense vectors and structured metadata. The extracted properties (persons, organizations, locations, categories) are exactly what Weaviate's Query Agent uses to construct filters from natural language queries.

import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEDocumentEnricher

enricher = SIEDocumentEnricher(
    base_url="http://localhost:8080",
    labels=["person", "organization", "location"],
    classify_model="knowledgator/gliclass-large-v3.0",
    classify_labels=["technical", "business", "legal"],
)

client = weaviate.connect_to_local()
try:
    collection = client.collections.create(
        "Documents",
        description="Documents with extracted entity and classification metadata.",
        properties=[
            wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(
                name="person", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="People mentioned in the document",
            ),
            wvc.config.Property(
                name="organization", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="Organizations mentioned in the document",
            ),
            wvc.config.Property(
                name="location", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="Locations mentioned in the document",
            ),
            wvc.config.Property(
                name="classification", data_type=wvc.config.DataType.TEXT,
                description="Document category: technical, business, or legal",
            ),
            wvc.config.Property(
                name="classification_score", data_type=wvc.config.DataType.NUMBER,
                description="Confidence score for the classification",
            ),
        ],
        vector_config=wvc.config.Configure.Vectors.self_provided(),
    )

    # Embed + extract in one call
    texts = [
        "John Smith presented Google's new AI strategy in New York.",
        "The court ruling on patent law affects tech companies.",
    ]
    docs = enricher.enrich(texts)
    collection.data.insert_many([
        wvc.data.DataObject(properties=doc.properties, vector=doc.vector)
        for doc in docs
    ])

    # The Query Agent can now filter on extracted properties:
    # "find documents about Google" → organization filter + vector search
    # "show me legal documents mentioning John Smith" → classification + person filter
    query_vec = enricher.enrich_query("AI strategy announcements")
    results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
    client.close()

Testing

# Unit tests (no server needed)
pytest

# Integration tests (requires SIE + Weaviate)
pytest -m integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sie_weaviate-0.6.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sie_weaviate-0.6.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file sie_weaviate-0.6.0.tar.gz.

File metadata

  • Download URL: sie_weaviate-0.6.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sie_weaviate-0.6.0.tar.gz
Algorithm Hash digest
SHA256 91ec6e46c01ceafd1effa0af1e0d9887453badfa7b42d6f00a41f45d4de415f1
MD5 75bcfef7deeade3f830665db00990504
BLAKE2b-256 6851d2825de8eb72a0d401fe2a8cf81f9119cc974717006e1cd4c7d6cb35d0d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sie_weaviate-0.6.0.tar.gz:

Publisher: release-python.yml on superlinked/sie-internal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sie_weaviate-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: sie_weaviate-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sie_weaviate-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6e3502efbda2a067ac0379a6a853a2e764f573c4eb63bcd499baa5264602605
MD5 0eee844b949e832b41e0d8d4bed8c500
BLAKE2b-256 d4442c7b1231345baba65cf093255bb3a2ca2d11ba66b648907d90cbed86dd9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sie_weaviate-0.6.0-py3-none-any.whl:

Publisher: release-python.yml on superlinked/sie-internal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page