LangChain integrations for Azure Storage
Project description
langchain-azure-storage
This package contains the LangChain integrations for Azure Storage. Currently, it includes:
[!NOTE] This package is in Public Preview. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Installation
pip install -U langchain-azure-storage
Configuration
langchain-azure-storage should work without any explicit credential configuration.
The langchain-azure-storage interface defaults to DefaultAzureCredential
for credentials which automatically retrieves Microsoft Entra ID tokens based on
your current environment. For more information on using credentials with
langchain-azure-storage, see the override default credentials section.
Azure Blob Storage Document Loader Usage
Document Loaders are used to load data from many sources (e.g., cloud storage, web pages, etc.) and turn them into LangChain Documents, which can then be used in AI applications (e.g., RAG). This package offers the AzureBlobStorageLoader which downloads blob content from Azure Blob Storage and parses it as UTF-8 by default. Additionally, parsing customization is also available to handle content of various file types and customize document chunking.
The AzureBlobStorageLoader replaces the current AzureBlobStorageContainerLoader and AzureBlobStorageFileLoader in the LangChain Community Document Loaders. Refer to the migration section for more details.
The following examples go over the various use cases for the document loader.
Load from container
Below shows how to load documents from all blobs in a given container in Azure Blob Storage:
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
)
for doc in loader.lazy_load():
print(doc.page_content) # Prints content of each blob in UTF-8 encoding.
The example below shows how to load documents from blobs in a container with a given prefix:
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
prefix="test",
)
for doc in loader.lazy_load():
print(doc.page_content)
Load from container by blob name
The example below shows how to load documents from a list of blobs in Azure Blob Storage. This approach does not call list blobs and instead uses only the blobs provided:
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
blob_names=["blob-1", "blob-2", "blob-3"],
)
for doc in loader.lazy_load():
print(doc.page_content)
Override default credentials
Below shows how to override the default credentials used by the document loader:
from azure.core.credentials import AzureSasCredential
from azure.identity import ManagedIdentityCredential
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
# Override with SAS token
loader = AzureBlobStorageLoader(
"https://<my-storage-account-name>.blob.core.windows.net",
"<my-container-name>",
credential=AzureSasCredential("<sas-token>")
)
# Override with more specific token credential than the entire
# default credential chain (e.g., system-assigned managed identity)
loader = AzureBlobStorageLoader(
"https://<my-storage-account-name>.blob.core.windows.net",
"<my-container-name>",
credential=ManagedIdentityCredential()
)
Customizing blob content parsing
Currently, the default when parsing each blob is to return the content as a single Document object with UTF-8 encoding regardless of the file type. For file types that require specific parsing (e.g., PDFs, CSVs, etc.) or when you want to control the document content format, you can provide the loader_factory argument to take in an already existing document loader (e.g., PyPDFLoader, CSVLoader, etc.) or a customized loader.
This works by downloading the blob content to a temporary file. The loader_factory then gets called with the filepath to use the specified document loader to load/parse the file and return the Document object(s).
Below shows how to override the default loader used to parse blobs as PDFs using the using the PyPDFLoader:
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader
loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
blob_names="<my-pdf-file.pdf>",
loader_factory=PyPDFLoader,
)
for doc in loader.lazy_load():
print(doc.page_content) # Prints content of each page as a separate document
To provide additional configuration, you can define a callable that returns an instantiated document loader as shown below:
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_community.document_loaders import PyPDFLoader
def loader_factory(file_path: str) -> PyPDFLoader:
return PyPDFLoader(
file_path,
mode="single", # To return the PDF as a single document instead of extracting documents by page
)
loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
blob_names="<my-pdf-file.pdf>",
loader_factory=loader_factory,
)
for doc in loader.lazy_load():
print(doc.page_content)
Migrating from LangChain Community Azure Storage Document Loaders
This section goes over the actions required to migrate from the existing community document loaders to the new Azure Blob Storage document loader:
- Depend on the
langchain-azure-storagepackage instead oflangchain-community. - Update import statements from
langchain_community.document_loaderstolangchain_azure_storage.document_loaders. - Change class names from
AzureBlobStorageFileLoaderandAzureBlobStorageContainerLoadertoAzureBlobStorageLoader. - Update document loader constructor calls to:
- Use an account URL instead of a connection string.
- Specify
UnstructuredLoaderas theloader_factoryif they want to continue to use Unstructured for parsing documents.
- Ensure environment has proper credentials (e.g., running
azure logincommand, setting up managed identity, etc.) as the connection string would have previously contained the credentials.
The examples below show the before and after migrating to the langchain-azure-storage package:
Before migration
from langchain_community.document_loaders import AzureBlobStorageFileLoader, AzureBlobStorageContainerLoader
file_loader = AzureBlobStorageFileLoader(
conn_str="<my-connection-string>",
container="<my-container-name>",
blob_name="<my-blob-name>",
)
container_loader = AzureBlobStorageContainerLoader(
conn_str="<my-connection-string>",
container="<my-container-name>",
prefix="<prefix>",
)
After migration
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
from langchain_unstructured import UnstructuredLoader
file_loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
blob_names="<my-blob-name>",
)
container_loader = AzureBlobStorageLoader(
account_url="https://<my-storage-account-name>.blob.core.windows.net",
container_name="<my-container-name>",
prefix="<prefix>",
loader_factory=UnstructuredLoader,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_azure_storage-1.0.1.tar.gz.
File metadata
- Download URL: langchain_azure_storage-1.0.1.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b85178270caa38491bdc132dd2b93f50528be9561603670073b088101f12962e
|
|
| MD5 |
3d65a50c66ed882922158163efe3c0cd
|
|
| BLAKE2b-256 |
c19a475a8a10beb98f0f3d4df216c6fffe6078b1229a899ad32df908a0a204d4
|
File details
Details for the file langchain_azure_storage-1.0.1-py3-none-any.whl.
File metadata
- Download URL: langchain_azure_storage-1.0.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
050501cd8ac86c03199ecd8df83e12898f0313b12880b5e0735418e681c15d99
|
|
| MD5 |
4452a06d99e80825421f8f9bc2d3b627
|
|
| BLAKE2b-256 |
4e70eb1499490c50a37fb1087fafa7d53f95bb31dac7d8fcbb5d1bdc8b45ab6f
|