Skip to main content

Pyspark custom data source for Microsoft Graph APIs, including path and query parameters, with PySpark read examples.

Project description

PySpark Microsoft Graph Source

A PySpark DataSource to seamlessly integrate and read data from Microsoft Graph API, enabling easy access to resources like SharePoint List Items, and more.


Features

  • Entra ID Authentication Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.

  • Automatic Pagination Handling Fetches all paginated data from Microsoft Graph without manual intervention.

  • Dynamic Schema Inference Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.

  • Simple Configuration with .option() Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.

  • Zero External Ingestion Services No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.

  • Extensible Resource Providers Add custom resource providers to support more Microsoft Graph endpoints as needed.

  • Pluggable Architecture Dynamically load resource providers without modifying core logic.

  • Optimized for PySpark Designed to work natively with Spark's DataFrame API for big data processing.

  • Secure by Design Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.


Installation

pip install pyspark-msgraph-source

⚡ Quickstart

1. Authentication

This package uses DefaultAzureCredential.
Ensure you're authenticated:

az login

Or set environment variables:

export AZURE_CLIENT_ID=<your-client-id>
export AZURE_TENANT_ID=<your-tenant-id>
export AZURE_CLIENT_SECRET=<your-client-secret>

2. Example Usage

from pyspark.sql import SparkSession

spark = SparkSession.builder \ 
.appName("MSGraphExample") \ 
.getOrCreate()

from pyspark_msgraph_source.core.source import MSGraphDataSource
spark.dataSource.register(MSGraphDataSource)

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.load()

df.show()

# with schema

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.schema("id string, Title string")
.load()

df.show()

Supported Resources

Resource Description
list_items SharePoint List Items
(more coming soon...)

Development

Coming soon...


Troubleshooting

Issue Solution
ValueError: resource missing Add .option("resource", "list_items")
Empty dataframe Verify IDs, permissions, and access
Authentication failures Check Azure credentials and login status

📄 License

MIT License


📚 Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_msgraph_source-0.2.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_msgraph_source-0.2.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_msgraph_source-0.2.0.tar.gz.

File metadata

  • Download URL: pyspark_msgraph_source-0.2.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.9 Linux/6.8.0-1021-azure

File hashes

Hashes for pyspark_msgraph_source-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3594a9d9fe12e87aca297640848d9699c8bfd99c37ca98643c964ee33b3d997b
MD5 ff867c941b79283dddb6f18cc59ff841
BLAKE2b-256 2519e28cabcd918017b8505e4fd5bdf2e9ff92a3188d2461a27a8f0aebcd7630

See more details on using hashes here.

File details

Details for the file pyspark_msgraph_source-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_msgraph_source-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4af4a14c1121a8ca46c8f481c7b036db6c7d63f577f52defb632df86b15ec8e1
MD5 aa6ab1421a3971befef697ad71382674
BLAKE2b-256 41939c5893b199a84ec2a2920c5ec612a82c9507c5ed9da0372eefd73e06eec2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page