Skip to main content

Pyspark custom data source for Microsoft Graph APIs, including path and query parameters, with PySpark read examples.

Project description

PySpark Microsoft Graph Source

A PySpark DataSource to seamlessly integrate and read data from Microsoft Graph API, enabling easy access to resources like SharePoint List Items, and more.


Features

  • Entra ID Authentication Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.

  • Automatic Pagination Handling Fetches all paginated data from Microsoft Graph without manual intervention.

  • Dynamic Schema Inference Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.

  • Simple Configuration with .option() Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.

  • Zero External Ingestion Services No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.

  • Extensible Resource Providers Add custom resource providers to support more Microsoft Graph endpoints as needed.

  • Pluggable Architecture Dynamically load resource providers without modifying core logic.

  • Optimized for PySpark Designed to work natively with Spark's DataFrame API for big data processing.

  • Secure by Design Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.


Installation

pip install pyspark-msgraph-source

⚡ Quickstart

1. Authentication

This package uses DefaultAzureCredential.
Ensure you're authenticated:

az login

Or set environment variables:

export AZURE_CLIENT_ID=<your-client-id>
export AZURE_TENANT_ID=<your-tenant-id>
export AZURE_CLIENT_SECRET=<your-client-secret>

2. Example Usage

from pyspark.sql import SparkSession

spark = SparkSession.builder \ 
.appName("MSGraphExample") \ 
.getOrCreate()

from pyspark_msgraph_source.core.source import MSGraphDataSource
spark.dataSource.register(MSGraphDataSource)

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.load()

df.show()

# with schema

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.schema("id string, Title string")
.load()

df.show()

Supported Resources

Resource Description
list_items SharePoint List Items
(more coming soon...)

Development

Coming soon...


Troubleshooting

Issue Solution
ValueError: resource missing Add .option("resource", "list_items")
Empty dataframe Verify IDs, permissions, and access
Authentication failures Check Azure credentials and login status

📄 License

MIT License


📚 Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_msgraph_source-0.3.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_msgraph_source-0.3.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_msgraph_source-0.3.0.tar.gz.

File metadata

  • Download URL: pyspark_msgraph_source-0.3.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.9 Linux/6.8.0-1021-azure

File hashes

Hashes for pyspark_msgraph_source-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d0bcfa9ffaa18d28690a0d17a18ddd2dee6e574a6b2bf57d9f216495b16ed43b
MD5 1c9d317576533d15f183f5764ad011c1
BLAKE2b-256 34475a2d6ee23a771b7f0acc6d1fda3ddfd40c48a5cc8d8cd0f8d3e43d3e3937

See more details on using hashes here.

File details

Details for the file pyspark_msgraph_source-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_msgraph_source-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbec51d137ece6f91d5fea5a4bfbe4404100f248879b34e13dc3ffb19a2486b3
MD5 0387b5565aa06802c5b39cd8b9e77760
BLAKE2b-256 08ef23c68984c13d17cd5281ecca69513c3db65fe5e32e2886beaaaafa735eef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page