Skip to main content

Custom sink for Azure AI Search

Project description

Actions Status PyPI Bytewax User Guide

Bytewax

bytewax-azure-ai-search

Custom sink for Azure AI Search vector database for real time indexing.

bytewax-azure-ai-search is commercially licensed with publicly available source code. Please see the full details in LICENSE.

Installation and import sample

To install you can run

pip install bytewax-azure-ai-search

Then import

from bytewax.bytewax_azure_ai_search import AzureSearchSink

You can then add it to your dataflow

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

flow = Dataflow("indexing-pipeline")
input_data = op.input("input", flow, FileSource("data/news_out.jsonl"))
deserialize_data = op.map("deserialize", input_data, safe_deserialize)
extract_html = op.map("extract_html", deserialize_data, process_event)
op.output("output", extract_html, azure_sink)

Note

This installation includes the following dependencies:

azure-search-documents==11.5.1
azure-common==1.1.28
azure-core==1.30.2
openai==1.44.1

These are used to write the vectors on the appropriate services based on an Azure schema provided. We will provide an example in this README for working versions of schema definition under these versions.

Setting up Azure AI services

This asumes you have set up an Azure AI Search service on the Azure portal. For more instructions, visit their documentation

Optional To generate embeddings, you can set up an Azure OpenAI service and deploy an embedding model such as text-ada-002-embedding

Once you have set up the resources, ensure to idenfity and store the following information from the Azure portal:

  • You Azure AI Search admin key
  • You Azure AI Search service name
  • You Azure AI Search service endpoint url

If you deployed an embedding model through Azure AI OpenAI service:

  • You Azure OpenAI endpoint url
  • You Azure OpenAI API key
  • Your Azure OpenAI service name
  • You Azure OpenAI embedding deployment name
  • Your Azure OpenAI embedding name (e.g. text-ada-002-embedding`)

Sample usage

You can find a complete example under the examples/ folder.

To execute the examples, you can generate a .env file with the following keywords:

# OpenAI
AZURE_OPENAI_ENDPOINT= <your-azure-openai-endpoint>
AZURE_OPENAI_API_KEY= <your-azure-openai-key>
AZURE_OPENAI_SERVICE=<your-azure-openai-named-service>
# Azure Document Search
AZURE_SEARCH_ADMIN_KEY=<your-azure-ai-search-admin-key>
AZURE_SEARCH_SERVICE=<your-azure-ai-search-named-service>
AZURE_SEARCH_SERVICE_ENDPOINT=<your-azure-ai-search-endpoint-url>

# Optional - if you prefer to generate embeddings with embedding models deployed on Azure
AZURE_EMBEDDING_DEPLOYMENT_NAME=<your-azure-openai-given-deployment-name>
AZURE_EMBEDDING_MODEL_NAME=<your-azure-openai-model-name>

# Optional - if you prefer to generate the embeddings with OpenAI
OPENAI_API_KEY=<your-openai-key>

Set up the connection and schema by running

python connection.py

You can verify the creation of the index was successful by visiting the portal.

If you click on the created index and press "Search" you can verify it was created - but empty at this point.

Generate the embeddings and store in Azure AI Search through the bytewax-azure-ai-search sink

python -m bytewax.run dataflow:flow

Verify the index was populated by pressing "Search" with an empty query.

Note

In the dataflow we initialized the custom sink as follows:

from bytewax.bytewax_azure_ai_search import AzureSearchSink

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

The schema and structure need to match how you configure the schema through the Azure AI Search Python API. For more information, visit their page

In this example:

from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
)

# Define schema
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
        sortable=True,
        facetable=True,
        key=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SearchableField(
        name="meta",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SimpleField(
        name="vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Double),
        searchable=False,
        filterable=False,
        sortable=False,
        facetable=False,
        vector_search_dimensions=DIMENSIONS,
        vector_search_profile_name="myHnswProfile",
    ),
]

For developers - Setting up the project

Install just

We use just as a command runner for actions / recipes related to developing Bytewax. Please follow the installation instructions. There's probably a package for your OS already.

Install pyenv and Python 3.12

I suggest using pyenv to manage python versions. the installation instructions.

You can also use your OS's package manager to get access to different Python versions.

Ensure that you have Python 3.12 installed and available as a "global shim" so that it can be run anywhere. The following will make plain python run your OS-wide interpreter, but will make 3.12 available via python3.12.

$ pyenv global system 3.12

Install uv

We use uv as a virtual environment creator, package installer, and dependency pin-er. There are a few different ways to install it, but I recommend installing it through either brew on macOS or pipx.

Development

We have a just recipe that will:

  1. Set up a venv in venvs/dev/.

  2. Install all dependencies into it in a reproducible way.

Start by adding any dependencies that are needed into pyproject.toml or into requirements/dev.in if they are needed for development.

Next, generate the pinned set of dependencies with

> just venv-compile-all

Create and activate a virtual environment

Once you have compiled your dependencies, run the following:

> just get-started

Activate your development environment and run the development task:

> . venvs/dev/bin/activate
> just develop

License

bytewax-azure-ai-search is commercially licensed with publicly available source code. You are welcome to prototype using this module for free, but any use on business data requires a paid license. See https://modules.bytewax.io/ for a license. Please see the full details in LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytewax_azure_ai_search-0.1.2.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

bytewax_azure_ai_search-0.1.2-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file bytewax_azure_ai_search-0.1.2.tar.gz.

File metadata

File hashes

Hashes for bytewax_azure_ai_search-0.1.2.tar.gz
Algorithm Hash digest
SHA256 975e99caf3c548c418d4a31e66fd85719f1bc366cb6f0029e4f3760d8871ee95
MD5 ed839ce6c4583398610407b05264a670
BLAKE2b-256 e45f31c87e615398b9e70ff14cff0977d53bf17e499a7282567f16daca1143ab

See more details on using hashes here.

File details

Details for the file bytewax_azure_ai_search-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for bytewax_azure_ai_search-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3b8e62a32ac6ea1fabe400f1bb25c2efb303f5e146ddf9a81bd141c15cbacba9
MD5 10136849076a65d894ac9166673010b4
BLAKE2b-256 14d2ea77826e7d81198c27fcc6c45ffd815c9c52a4ceb0eb81efb0b4423a8e1b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page