Skip to main content

LangChain integrations for Azure Storage

Project description

langchain-azure-storage

This package contains the LangChain integrations for Azure Storage. Currently, it includes:

[!NOTE] This package is in Public Preview. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Installation

pip install -U langchain-azure-storage

Configuration

langchain-azure-storage should work without any explicit credential configuration.

The langchain-azure-storage interface defaults to DefaultAzureCredential for credentials which automatically retrieves Microsoft Entra ID tokens based on your current environment. For more information on using credentials with langchain-azure-storage, see the override default credentials section.

Azure Blob Storage Document Loader Usage

Document Loaders are used to load data from many sources (e.g., cloud storage, web pages, etc.) and turn them into LangChain Documents, which can then be used in AI applications (e.g., RAG). This package offers the AzureBlobStorageLoader which downloads blob content from Azure Blob Storage and parses it as UTF-8 by default. Additionally, parsing customization is also available to handle content of various file types and customize document chunking.

The AzureBlobStorageLoader replaces the current AzureBlobStorageContainerLoader and AzureBlobStorageFileLoader in the LangChain Community Document Loaders. Refer to the migration section for more details.

The following examples go over the various use cases for the document loader.

Load from container

Below shows how to load documents from all blobs in a given container in Azure Blob Storage:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
)

for doc in loader.lazy_load():
    print(doc.page_content)  # Prints content of each blob in UTF-8 encoding.

The example below shows how to load documents from blobs in a container with a given prefix:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    prefix="test",
)

for doc in loader.lazy_load():
    print(doc.page_content)

Load from container by blob name

The example below shows how to load documents from a list of blobs in Azure Blob Storage. This approach does not call list blobs and instead uses only the blobs provided:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names=["blob-1", "blob-2", "blob-3"],
)

for doc in loader.lazy_load():
    print(doc.page_content)

Override default credentials

Below shows how to override the default credentials used by the document loader:

from azure.core.credentials import AzureSasCredential
from azure.identity import ManagedIdentityCredential
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader

# Override with SAS token
loader = AzureBlobStorageLoader(
    "https://<my-storage-account-name>.blob.core.windows.net",
    "<my-container-name>",
    credential=AzureSasCredential("<sas-token>")
)

# Override with more specific token credential than the entire
# default credential chain (e.g., system-assigned managed identity)
loader = AzureBlobStorageLoader(
    "https://<my-storage-account-name>.blob.core.windows.net",
    "<my-container-name>",
    credential=ManagedIdentityCredential()
)

Customizing blob content parsing

Currently, the default when parsing each blob is to return the content as a single Document object with UTF-8 encoding regardless of the file type. For file types that require specific parsing (e.g., PDFs, CSVs, etc.) or when you want to control the document content format, you can provide the loader_factory argument to take in an already existing document loader (e.g., PyPDFLoader, CSVLoader, etc.) or a customized loader.

This works by downloading the blob content to a temporary file. The loader_factory then gets called with the filepath to use the specified document loader to load/parse the file and return the Document object(s).

Below shows how to override the default loader used to parse blobs as PDFs using the using the PyPDFLoader:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-pdf-file.pdf>",
    loader_factory=PyPDFLoader,
)

for doc in loader.lazy_load():
    print(doc.page_content)  # Prints content of each page as a separate document

To provide additional configuration, you can define a callable that returns an instantiated document loader as shown below:

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader

def loader_factory(file_path: str) -> PyPDFLoader:
    return PyPDFLoader(
        file_path,
        mode="single",  # To return the PDF as a single document instead of extracting documents by page
    )

loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-pdf-file.pdf>",
    loader_factory=loader_factory,
)

for doc in loader.lazy_load():
    print(doc.page_content)

Migrating from LangChain Community Azure Storage Document Loaders

This section goes over the actions required to migrate from the existing community document loaders to the new Azure Blob Storage document loader:

  1. Depend on the langchain-azure-storage package instead of langchain-community.
  2. Update import statements from langchain_community.document_loaders to langchain_azure_storage.document_loaders.
  3. Change class names from AzureBlobStorageFileLoader and AzureBlobStorageContainerLoader to AzureBlobStorageLoader.
  4. Update document loader constructor calls to:
    1. Use an account URL instead of a connection string.
    2. Specify UnstructuredLoader as the loader_factory if they want to continue to use Unstructured for parsing documents.
  5. Ensure environment has proper credentials (e.g., running azure login command, setting up managed identity, etc.) as the connection string would have previously contained the credentials.

The examples below show the before and after migrating to the langchain-azure-storage package:

Before migration

from langchain_community.document_loaders import AzureBlobStorageFileLoader, AzureBlobStorageContainerLoader

file_loader = AzureBlobStorageFileLoader(
    conn_str="<my-connection-string>",
    container="<my-container-name>",
    blob_name="<my-blob-name>",
)

container_loader = AzureBlobStorageContainerLoader(
    conn_str="<my-connection-string>",
    container="<my-container-name>",
    prefix="<prefix>",
)

After migration

from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_unstructured import UnstructuredLoader

file_loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    blob_names="<my-blob-name>",
)

container_loader = AzureBlobStorageLoader(
    account_url="https://<my-storage-account-name>.blob.core.windows.net",
    container_name="<my-container-name>",
    prefix="<prefix>",
    loader_factory=UnstructuredLoader,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_azure_storage-1.0.1.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_azure_storage-1.0.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file langchain_azure_storage-1.0.1.tar.gz.

File metadata

  • Download URL: langchain_azure_storage-1.0.1.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_azure_storage-1.0.1.tar.gz
Algorithm Hash digest
SHA256 b85178270caa38491bdc132dd2b93f50528be9561603670073b088101f12962e
MD5 3d65a50c66ed882922158163efe3c0cd
BLAKE2b-256 c19a475a8a10beb98f0f3d4df216c6fffe6078b1229a899ad32df908a0a204d4

See more details on using hashes here.

File details

Details for the file langchain_azure_storage-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_azure_storage-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 050501cd8ac86c03199ecd8df83e12898f0313b12880b5e0735418e681c15d99
MD5 4452a06d99e80825421f8f9bc2d3b627
BLAKE2b-256 4e70eb1499490c50a37fb1087fafa7d53f95bb31dac7d8fcbb5d1bdc8b45ab6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page