bytewax-azure-ai-search

Custom sink for Azure AI Search

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language

Project description

bytewax-azure-ai-search

Custom sink for Azure AI Search vector database for real time indexing.

bytewax-azure-ai-search is commercially licensed with publicly available source code. Please see the full details in LICENSE.

Installation and import sample

To install you can run

pip install bytewax-azure-ai-search

Then import

from bytewax.bytewax_azure_ai_search import AzureSearchSink

You can then add it to your dataflow

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

flow = Dataflow("indexing-pipeline")
input_data = op.input("input", flow, FileSource("data/news_out.jsonl"))
deserialize_data = op.map("deserialize", input_data, safe_deserialize)
extract_html = op.map("extract_html", deserialize_data, process_event)
op.output("output", extract_html, azure_sink)

Note

This installation includes the following dependencies:

azure-search-documents==11.5.1
azure-common==1.1.28
azure-core==1.30.2
openai==1.44.1

These are used to write the vectors on the appropriate services based on an Azure schema provided. We will provide an example in this README for working versions of schema definition under these versions.

Setting up Azure AI services

This asumes you have set up an Azure AI Search service on the Azure portal. For more instructions, visit their documentation

Optional To generate embeddings, you can set up an Azure OpenAI service and deploy an embedding model such as text-ada-002-embedding

Once you have set up the resources, ensure to idenfity and store the following information from the Azure portal:

You Azure AI Search admin key
You Azure AI Search service name
You Azure AI Search service endpoint url

If you deployed an embedding model through Azure AI OpenAI service:

You Azure OpenAI endpoint url
You Azure OpenAI API key
Your Azure OpenAI service name
You Azure OpenAI embedding deployment name
Your Azure OpenAI embedding name (e.g. text-ada-002-embedding`)

Sample usage

You can find a complete example under the examples/ folder.

To execute the examples, you can generate a .env file with the following keywords:

# OpenAI
AZURE_OPENAI_ENDPOINT= <your-azure-openai-endpoint>
AZURE_OPENAI_API_KEY= <your-azure-openai-key>
AZURE_OPENAI_SERVICE=<your-azure-openai-named-service>
# Azure Document Search
AZURE_SEARCH_ADMIN_KEY=<your-azure-ai-search-admin-key>
AZURE_SEARCH_SERVICE=<your-azure-ai-search-named-service>
AZURE_SEARCH_SERVICE_ENDPOINT=<your-azure-ai-search-endpoint-url>

# Optional - if you prefer to generate embeddings with embedding models deployed on Azure
AZURE_EMBEDDING_DEPLOYMENT_NAME=<your-azure-openai-given-deployment-name>
AZURE_EMBEDDING_MODEL_NAME=<your-azure-openai-model-name>

# Optional - if you prefer to generate the embeddings with OpenAI
OPENAI_API_KEY=<your-openai-key>

Set up the connection and schema by running

python connection.py

You can verify the creation of the index was successful by visiting the portal.

If you click on the created index and press "Search" you can verify it was created - but empty at this point.

Generate the embeddings and store in Azure AI Search through the bytewax-azure-ai-search sink

python -m bytewax.run dataflow:flow

Verify the index was populated by pressing "Search" with an empty query.

Note

In the dataflow we initialized the custom sink as follows:

from bytewax.bytewax_azure_ai_search import AzureSearchSink

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

The schema and structure need to match how you configure the schema through the Azure AI Search Python API. For more information, visit their page

In this example:

from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
)

# Define schema
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
        sortable=True,
        facetable=True,
        key=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SearchableField(
        name="meta",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SimpleField(
        name="vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Double),
        searchable=False,
        filterable=False,
        sortable=False,
        facetable=False,
        vector_search_dimensions=DIMENSIONS,
        vector_search_profile_name="myHnswProfile",
    ),
]

For developers - Setting up the project

Install `just`

We use just as a command runner for actions / recipes related to developing Bytewax. Please follow the installation instructions. There's probably a package for your OS already.

Install `pyenv` and Python 3.12

I suggest using pyenv to manage python versions. the installation instructions.

You can also use your OS's package manager to get access to different Python versions.

Ensure that you have Python 3.12 installed and available as a "global shim" so that it can be run anywhere. The following will make plain python run your OS-wide interpreter, but will make 3.12 available via python3.12.

$ pyenv global system 3.12

Install `uv`

We use uv as a virtual environment creator, package installer, and dependency pin-er. There are a few different ways to install it, but I recommend installing it through either brew on macOS or pipx.

Development

We have a just recipe that will:

Set up a venv in venvs/dev/.
Install all dependencies into it in a reproducible way.

Start by adding any dependencies that are needed into pyproject.toml or into requirements/dev.in if they are needed for development.

Next, generate the pinned set of dependencies with

> just venv-compile-all

Create and activate a virtual environment

Once you have compiled your dependencies, run the following:

> just get-started

Activate your development environment and run the development task:

> . venvs/dev/bin/activate
> just develop

License

bytewax-azure-ai-search is commercially licensed with publicly available source code. You are welcome to prototype using this module for free, but any use on business data requires a paid license. See https://modules.bytewax.io/ for a license. Please see the full details in LICENSE.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

This version

0.1.2

Sep 12, 2024

0.1.1

Sep 12, 2024

0.1

Aug 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytewax_azure_ai_search-0.1.2.tar.gz (26.0 kB view hashes)

Uploaded Sep 12, 2024 Source

Built Distribution

bytewax_azure_ai_search-0.1.2-py3-none-any.whl (22.4 kB view hashes)

Uploaded Sep 12, 2024 Python 3

Hashes for bytewax_azure_ai_search-0.1.2.tar.gz

Hashes for bytewax_azure_ai_search-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`975e99caf3c548c418d4a31e66fd85719f1bc366cb6f0029e4f3760d8871ee95`
MD5	`ed839ce6c4583398610407b05264a670`
BLAKE2b-256	`e45f31c87e615398b9e70ff14cff0977d53bf17e499a7282567f16daca1143ab`

Hashes for bytewax_azure_ai_search-0.1.2-py3-none-any.whl

Hashes for bytewax_azure_ai_search-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b8e62a32ac6ea1fabe400f1bb25c2efb303f5e146ddf9a81bd141c15cbacba9`
MD5	`10136849076a65d894ac9166673010b4`
BLAKE2b-256	`14d2ea77826e7d81198c27fcc6c45ffd815c9c52a4ceb0eb81efb0b4423a8e1b`

bytewax-azure-ai-search 0.1.2

Navigation

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bytewax-azure-ai-search

Installation and import sample

Setting up Azure AI services

Sample usage

For developers - Setting up the project

Install `just`

Install `pyenv` and Python 3.12

Install `uv`

Development

Create and activate a virtual environment

License

Project details

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

bytewax-azure-ai-search 0.1.2

Navigation

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bytewax-azure-ai-search

Installation and import sample

Setting up Azure AI services

Sample usage

For developers - Setting up the project

Install just

Install pyenv and Python 3.12

Install uv

Development

Create and activate a virtual environment

License

Project details

Verified details (What is this?)

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Install `just`

Install `pyenv` and Python 3.12

Install `uv`