Skip to main content

LangChain integrations for Azure Storage

Project description

langchain-azure-storage

This package contains the LangChain integrations for Azure Storage. Currently, it includes:

[!NOTE] This package is in Public Preview. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Installation

pip install -U langchain-azure-storage

Configuration

langchain-azure-storage should work without any explicit credential configuration.

The langchain-azure-storage interface defaults to DefaultAzureCredential for credentials which automatically retrieves Microsoft Entra ID tokens based on your current environment. For more information on using credentials with langchain-azure-storage, see the override default credentials section.

Azure Blob Storage Document Loader Usage

Document Loaders are used to load data from many sources (e.g., cloud storage, web pages, etc.) and turn them into LangChain Documents, which can then be used in AI applications (e.g., RAG). This package offers the AzureBlobStorageLoader which downloads blob content from Azure Blob Storage and parses it as UTF-8 by default. Additionally, parsing customization is also available to handle content of various file types and customize document chunking.

The AzureBlobStorageLoader replaces the current AzureBlobStorageContainerLoader and AzureBlobStorageFileLoader in the LangChain Community Document Loaders. Refer to the migration section for more details.

The following examples go over the various use cases for the document loader.

Load from container

Below shows how to load documents from all blobs in a given container in Azure Blob Storage:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
)

for doc in loader.lazy_load():
    print(doc.page_content)  # Prints content of each blob in UTF-8 encoding.

The example below shows how to load documents from blobs in a container with a given prefix:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    prefix="test",
)

for doc in loader.lazy_load():
    print(doc.page_content)

Load from container by blob name

The example below shows how to load documents from a list of blobs in Azure Blob Storage. This approach does not call list blobs and instead uses only the blobs provided:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names=["blob-1", "blob-2", "blob-3"],
)

for doc in loader.lazy_load():
    print(doc.page_content)

Override default credentials

Below shows how to override the default credentials used by the document loader:

from azure.core.credentials import AzureSasCredential
from azure.identity import ManagedIdentityCredential
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

# Override with SAS token
loader = AzureBlobStorageLoader(
    "https://<my-storage-account-name>.blob.core.windows.net",
    "<my-container-name>",
    credential=AzureSasCredential("<sas-token>")
)

# Override with more specific token credential than the entire
# default credential chain (e.g., system-assigned managed identity)
loader = AzureBlobStorageLoader(
    "https://<my-storage-account-name>.blob.core.windows.net",
    "<my-container-name>",
    credential=ManagedIdentityCredential()
)

Customizing blob content parsing

Currently, the default when parsing each blob is to return the content as a single Document object with UTF-8 encoding regardless of the file type. For file types that require specific parsing (e.g., PDFs, CSVs, etc.) or when you want to control the document content format, you can provide the loader_factory argument to take in an already existing document loader (e.g., PyPDFLoader, CSVLoader, etc.) or a customized loader.

This works by downloading the blob content to a temporary file. The loader_factory then gets called with the filepath to use the specified document loader to load/parse the file and return the Document object(s).

Below shows how to override the default loader used to parse blobs as PDFs using the using the PyPDFLoader:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-pdf-file.pdf>",
    loader_factory=PyPDFLoader,
)

for doc in loader.lazy_load():
    print(doc.page_content)  # Prints content of each page as a separate document

To provide additional configuration, you can define a callable that returns an instantiated document loader as shown below:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader

def loader_factory(file_path: str) -> PyPDFLoader:
    return PyPDFLoader(
        file_path,
        mode="single",  # To return the PDF as a single document instead of extracting documents by page
    )

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-pdf-file.pdf>",
    loader_factory=loader_factory,
)

for doc in loader.lazy_load():
    print(doc.page_content)

Migrating from LangChain Community Azure Storage Document Loaders

This section goes over the actions required to migrate from the existing community document loaders to the new Azure Blob Storage document loader:

  1. Depend on the langchain-azure-storage package instead of langchain-community.
  2. Update import statements from langchain_community.document_loaders to langchain_azure_storage.document_loaders.
  3. Change class names from AzureBlobStorageFileLoader and AzureBlobStorageContainerLoader to AzureBlobStorageLoader.
  4. Update document loader constructor calls to:
    1. Use an account URL instead of a connection string.
    2. Specify UnstructuredLoader as the loader_factory if they want to continue to use Unstructured for parsing documents.
  5. Ensure environment has proper credentials (e.g., running azure login command, setting up managed identity, etc.) as the connection string would have previously contained the credentials.

The examples below show the before and after migrating to the langchain-azure-storage package:

Before migration

from langchain_community.document_loaders import AzureBlobStorageFileLoader, AzureBlobStorageContainerLoader

file_loader = AzureBlobStorageFileLoader(
    conn_str="<my-connection-string>",
    container="<my-container-name>",
    blob_name="<my-blob-name>",
)

container_loader = AzureBlobStorageContainerLoader(
    conn_str="<my-connection-string>",
    container="<my-container-name>",
    prefix="<prefix>",
)

After migration

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_unstructured import UnstructuredLoader

file_loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-blob-name>",
)

container_loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    prefix="<prefix>",
    loader_factory=UnstructuredLoader,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_azure_storage-1.0.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_azure_storage-1.0.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file langchain_azure_storage-1.0.0.tar.gz.

File metadata

  • Download URL: langchain_azure_storage-1.0.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_azure_storage-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ee1b84ecf011bcf3e56df7007f2f73a36cc3f83fcdb3420ba13dcfc19cc831f2
MD5 49f2a33f209d34a921455fee0edaba4b
BLAKE2b-256 66b332ea08d0d170396bea645cd348fa71aca8eea42bb7dea6340392d04177c6

See more details on using hashes here.

File details

Details for the file langchain_azure_storage-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_azure_storage-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec6e00c062777c241898ec3410a688d3a0e2663dc7b2dc7dff4d8c8b68f49488
MD5 ca433468e750188f66b6774ef18bc25c
BLAKE2b-256 e59d8d9e5d6b7ed8b9d61d16a3edceb5c22da41c92ab0044871d6f3370d1fa2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page