Declarative REST API ingestion for PySpark
Project description
Welcome to Polymo
Polymo makes it super easy to ingest APIs with Pyspark. It's like slicing cake.
My vision is that API ingestion doesn't need heavy, third party tools or hard to maintain custom code. The heck, you don't even need Pyspark skills.
How does it work?
Define a config file manually or use the recommended, lightweight builder UI. Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:
from pyspark.sql import SparkSession
from polymo import ApiReader
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)
df = (
spark.read.format("polymo")
.option("config_path", "./config.yml") # YAML you saved from the Builder
.option("token", "YOUR_TOKEN") # Only if the API needs one
.load()
)
df.show()
Structured Streaming works out of the box aswell:
stream_df = (
spark.readStream.format("polymo")
.option("config_path", "./config.yml")
.option("stream_batch_size", 100)
.option("stream_progress_path", "/tmp/polymo-progress.json")
.load()
)
query = stream_df.writeStream.format("memory").outputMode("append").queryName("polymo")
query.show()
Does it perform? Polymo can read in batches (pages in parallel) and therefore is much faster than row based solutions like UDFs.
It's still early days, but Polymo already supports a lot of features!
- Various Authentication options
- Many Pagination patterns, plus automatic partition-aware reading when totals are exposed.
- Several partitioning stategies for parallel Spark reads.
- Incremental sync support with cursor parameters, JSON state files on local or remote storage, optional memory caching, and overrideable state keys.
- Schema controls that auto-infer types or accept Spark SQL schemas, along with record selectors, filtering expressions, and schema-based casting for nested responses.
- Structured Streaming compatibility with
spark.readStream, tunable batch sizing, durable progress tracking, and a streaming smoke test mode. - Error handling through configurable retry counts, status code lists, timeout handling, and exponential backoff settings.
- Jinja templating of query parameters gives you a ton of flexibility
How to start?
Locally you probably want to install polymo with the UI:
pip install "polymo[builder]"
This comes with UI deps such as pyspark
Running Polymo on an existing cluster in for instance databricks doesnt require these deps. In that case, just install the bare minimum depa with
pip install polymo
Launch the builder UI
polymo builder
(Optional) Run the Builder in Docker
docker compose up --build builder
- The service listens on port
8000; open http://localhost:8000 once Uvicorn reports it is running.
Where to Next
Read the docs here
Contributions and early feedback welcome!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polymo-0.8.0.tar.gz.
File metadata
- Download URL: polymo-0.8.0.tar.gz
- Upload date:
- Size: 197.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e2cff43969d0a3b10c9fd94c4ca631d28309adb4509bbddfe118bb26b9e182c
|
|
| MD5 |
9c6885ded3b8e31448ffac15ca86ad68
|
|
| BLAKE2b-256 |
823521db0465d4b64ace849200d3007ea026f8743fa4b6dca43b690d47d9de74
|
Provenance
The following attestation bundles were made for polymo-0.8.0.tar.gz:
Publisher:
release.yml on dan1elt0m/polymo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polymo-0.8.0.tar.gz -
Subject digest:
3e2cff43969d0a3b10c9fd94c4ca631d28309adb4509bbddfe118bb26b9e182c - Sigstore transparency entry: 594282519
- Sigstore integration time:
-
Permalink:
dan1elt0m/polymo@8f7e44326412c06ffeb01ecf80767a204bba00f6 -
Branch / Tag:
refs/tags/0.8.0 - Owner: https://github.com/dan1elt0m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f7e44326412c06ffeb01ecf80767a204bba00f6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file polymo-0.8.0-py3-none-any.whl.
File metadata
- Download URL: polymo-0.8.0-py3-none-any.whl
- Upload date:
- Size: 210.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f42e1243a1f5354722c57d6c304bb02047246cc34fafddfb2dfabb3f038cbbac
|
|
| MD5 |
0395c410f2385965749f06c08fc77b29
|
|
| BLAKE2b-256 |
3aae29ad5f1717a096ac186805affd3fb2af44d515531ed064cf4c0d33c17924
|
Provenance
The following attestation bundles were made for polymo-0.8.0-py3-none-any.whl:
Publisher:
release.yml on dan1elt0m/polymo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polymo-0.8.0-py3-none-any.whl -
Subject digest:
f42e1243a1f5354722c57d6c304bb02047246cc34fafddfb2dfabb3f038cbbac - Sigstore transparency entry: 594282567
- Sigstore integration time:
-
Permalink:
dan1elt0m/polymo@8f7e44326412c06ffeb01ecf80767a204bba00f6 -
Branch / Tag:
refs/tags/0.8.0 - Owner: https://github.com/dan1elt0m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f7e44326412c06ffeb01ecf80767a204bba00f6 -
Trigger Event:
release
-
Statement type: