Skip to main content

SIE integration for Weaviate

Project description

sie-weaviate

SIE integration for Weaviate v4.

Two integration paths

1. Client-side (this package, works now)

sie-weaviate provides vectorizer and enrichment helpers that call SIE's encode() and extract() and return data in the format Weaviate expects. You configure collections with Configure.Vectors.self_provided() and pass vectors on insert/query.

pip install sie-weaviate
import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEVectorizer

vectorizer = SIEVectorizer(base_url="http://localhost:8080", model="BAAI/bge-m3")

client = weaviate.connect_to_local()
try:
    collection = client.collections.create(
        "Documents",
        properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
        vector_config=wvc.config.Configure.Vectors.self_provided(),
    )

    texts = ["first doc", "second doc"]
    vectors = vectorizer.embed_documents(texts)
    collection.data.insert_many([
        wvc.data.DataObject(properties={"text": t}, vector=v)
        for t, v in zip(texts, vectors)
    ])

    query_vec = vectorizer.embed_query("search text")
    results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
    client.close()

2. Server-side module (partnership, planned)

A text2vec-sie Go module for the Weaviate server that enables native vectorizer config (Configure.Vectorizer.text2vec_sie(...)). See weaviate-module-spec/ for the spec and reference implementation.

Named vectors (dense + multivector)

SIENamedVectorizer produces multiple vector types in one SIE call. Use it with ColBERT models that output both dense and multivector (per-token) embeddings:

from sie_weaviate import SIENamedVectorizer

vectorizer = SIENamedVectorizer(
    base_url="http://localhost:8080",
    model="jinaai/jina-colbert-v2",
    output_types=["dense", "multivector"],
)

collection = client.collections.create(
    "Documents",
    properties=[wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)],
    vector_config=[
        wvc.config.Configure.Vectors.self_provided(name="dense"),
        wvc.config.Configure.Vectors.self_provided(name="multivector"),
    ],
)

named = vectorizer.embed_documents(["hello world"])
collection.data.insert_many([
    wvc.data.DataObject(properties={"text": "hello world"}, vector=named[0])
])

For hybrid search, Weaviate has built-in BM25 — no extra vectors needed:

results = collection.query.hybrid(query="search text", alpha=0.75)

Document enrichment for Query Agent

SIEDocumentEnricher combines SIE's embedding and entity extraction pipelines to produce documents with dense vectors and structured metadata. The extracted properties (persons, organizations, locations, categories) are exactly what Weaviate's Query Agent uses to construct filters from natural language queries.

import weaviate
import weaviate.classes as wvc
from sie_weaviate import SIEDocumentEnricher

enricher = SIEDocumentEnricher(
    base_url="http://localhost:8080",
    labels=["person", "organization", "location"],
    classify_model="knowledgator/gliclass-large-v3.0",
    classify_labels=["technical", "business", "legal"],
)

client = weaviate.connect_to_local()
try:
    collection = client.collections.create(
        "Documents",
        description="Documents with extracted entity and classification metadata.",
        properties=[
            wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
            wvc.config.Property(
                name="person", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="People mentioned in the document",
            ),
            wvc.config.Property(
                name="organization", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="Organizations mentioned in the document",
            ),
            wvc.config.Property(
                name="location", data_type=wvc.config.DataType.TEXT_ARRAY,
                description="Locations mentioned in the document",
            ),
            wvc.config.Property(
                name="classification", data_type=wvc.config.DataType.TEXT,
                description="Document category: technical, business, or legal",
            ),
            wvc.config.Property(
                name="classification_score", data_type=wvc.config.DataType.NUMBER,
                description="Confidence score for the classification",
            ),
        ],
        vector_config=wvc.config.Configure.Vectors.self_provided(),
    )

    # Embed + extract in one call
    texts = [
        "John Smith presented Google's new AI strategy in New York.",
        "The court ruling on patent law affects tech companies.",
    ]
    docs = enricher.enrich(texts)
    collection.data.insert_many([
        wvc.data.DataObject(properties=doc.properties, vector=doc.vector)
        for doc in docs
    ])

    # The Query Agent can now filter on extracted properties:
    # "find documents about Google" → organization filter + vector search
    # "show me legal documents mentioning John Smith" → classification + person filter
    query_vec = enricher.enrich_query("AI strategy announcements")
    results = collection.query.near_vector(near_vector=query_vec, limit=5)
finally:
    client.close()

Testing

# Unit tests (no server needed)
pytest

# Integration tests (requires SIE + Weaviate)
pytest -m integration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sie_weaviate-0.3.4.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sie_weaviate-0.3.4-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file sie_weaviate-0.3.4.tar.gz.

File metadata

  • Download URL: sie_weaviate-0.3.4.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sie_weaviate-0.3.4.tar.gz
Algorithm Hash digest
SHA256 bbcba9d877d8f653edb82dbb1d313dd3ecc5a1ffba34d7ad6b551003b8fa36c9
MD5 18c69b2c627f212b29b48643a80346b0
BLAKE2b-256 3a4762687a70f95dbaa980c5e5c8a686f0ec0b0ef9562f233e293f9aef63724c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sie_weaviate-0.3.4.tar.gz:

Publisher: release-python.yml on superlinked/sie-internal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sie_weaviate-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: sie_weaviate-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sie_weaviate-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d6bd59ec45dd1a2fd8a1796d56aa79c51da796307d4484a80d44df5d48df8efd
MD5 afca6449ad7b4d05594effbaae01c9e2
BLAKE2b-256 bfab4164b7f6a57492de3e475af81915e393116500d3bd17b18a7ae3862beb74

See more details on using hashes here.

Provenance

The following attestation bundles were made for sie_weaviate-0.3.4-py3-none-any.whl:

Publisher: release-python.yml on superlinked/sie-internal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page