llama-index readers microsoft_sharepoint integration
Project description
Microsoft SharePoint Reader
pip install llama-index-readers-microsoft-sharepoint
The loader loads the files from a folder in SharePoint site or SharePoint Site Pages.
It also supports traversing recursively through the sub-folders.
Prerequisites
App Authentication using Microsoft Entra ID (formerly Azure AD)
- You need to create an App Registration in Microsoft Entra ID. Refer here
- API Permissions for the created app:
- Microsoft Graph → Application Permissions → Sites.Read.All (Grant Admin Consent) (Allows access to all sites in the tenant)
- OR Microsoft Graph → Application Permissions → Sites.Selected (Grant Admin Consent) (Allows access only to specific sites you select and grant permissions for)
- Microsoft Graph → Application Permissions → Files.Read.All (Grant Admin Consent)
- Microsoft Graph → Application Permissions → BrowserSiteLists.Read.All (Grant Admin Consent)
Note: If you use
Sites.Selected, you must grant your app access to the specific SharePoint site(s) via the SharePoint admin center. See Grant access to a specific site for details.
More info on Microsoft Graph APIs - Refer here
Usage
To use this loader client_id, client_secret and tenant_id of the registered app in Microsoft Azure Portal is required.
Loading Files from SharePoint Drive
This loader loads the files present in a specific folder in SharePoint.
If the files are present in the Test folder in SharePoint Site under root directory, then the input for the loader for file_path is Test
from llama_index.readers.microsoft_sharepoint import SharePointReader
loader = SharePointReader(
client_id="<Client ID of the app>",
client_secret="<Client Secret of the app>",
tenant_id="<Tenant ID of the Microsoft Azure Directory>",
)
documents = loader.load_data(
sharepoint_site_name="<Sharepoint Site Name>",
sharepoint_folder_path="<Folder Path>",
recursive=True,
)
Using Sites.Selected Permission
If you have only been granted access to a specific site (using Sites.Selected), you can use the site host name and relative URL instead of the site name:
from llama_index.readers.microsoft_sharepoint import SharePointReader
loader = SharePointReader(
client_id="<Client ID of the app>",
client_secret="<Client Secret of the app>",
tenant_id="<Tenant ID of the Microsoft Azure Directory>",
sharepoint_host_name="contoso.sharepoint.com",
sharepoint_relative_url="sites/YourSiteName",
)
documents = loader.load_data(
sharepoint_folder_path="<Folder Path>",
recursive=True,
)
Loading SharePoint Site Pages
You can also load SharePoint Site Pages as documents by setting sharepoint_type to PAGE:
from llama_index.readers.microsoft_sharepoint import (
SharePointReader,
SharePointType,
)
loader = SharePointReader(
client_id="<Client ID of the app>",
client_secret="<Client Secret of the app>",
tenant_id="<Tenant ID of the Microsoft Azure Directory>",
sharepoint_site_name="<Sharepoint Site Name>",
sharepoint_host_name="<your-tenant>.sharepoint.com",
sharepoint_relative_url="/sites/<YourSite>",
sharepoint_type=SharePointType.PAGE,
)
# Load all pages
documents = loader.load_data()
# Or load a specific page by ID
loader.sharepoint_file_id = "<page_id>"
documents = loader.load_data()
Filtering Pages with Callbacks
You can filter which pages to process using the process_document_callback:
def page_filter(page_name: str) -> bool:
# Only process pages that don't start with "Draft"
return not page_name.startswith("Draft")
loader = SharePointReader(
client_id="<Client ID>",
client_secret="<Client Secret>",
tenant_id="<Tenant ID>",
sharepoint_site_name="<Site Name>",
sharepoint_type=SharePointType.PAGE,
process_document_callback=page_filter,
)
Error Handling
Control error behavior with fail_on_error:
loader = SharePointReader(
client_id="<Client ID>",
client_secret="<Client Secret>",
tenant_id="<Tenant ID>",
fail_on_error=False, # Log errors and continue instead of raising
)
Instrumentation Events
The SharePoint reader emits events during page processing for monitoring:
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.readers.microsoft_sharepoint import (
TotalPagesToProcessEvent,
PageDataFetchCompletedEvent,
PageFailedEvent,
)
class SharePointEventHandler(BaseEventHandler):
def handle(self, event):
if isinstance(event, TotalPagesToProcessEvent):
print(f"Processing {event.total_pages} pages...")
elif isinstance(event, PageDataFetchCompletedEvent):
print(f"Completed: {event.page_id}")
elif isinstance(event, PageFailedEvent):
print(f"Failed: {event.page_id} - {event.error}")
dispatcher = get_dispatcher("llama_index.readers.microsoft_sharepoint.base")
dispatcher.add_event_handler(SharePointEventHandler())
Available events:
TotalPagesToProcessEvent: Total number of pages to processPageDataFetchStartedEvent: Page processing startedPageDataFetchCompletedEvent: Page successfully processedPageSkippedEvent: Page skipped (via callback)PageFailedEvent: Page processing failed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters