Skip to main content

Declarative REST API ingestion for PySpark

Project description

Polymo

Welcome to Polymo

Polymo makes it super easy to ingest APIs with Pyspark. It's like slicing cake.

My vision is that API ingestion doesn't need heavy, third party tools or hard to maintain custom code. The heck, you don't even need Pyspark skills.

Polymo Builder UI - connector preview screen

How does it work?

Define a config file manually or use the recommended, lightweight builder UI. Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:

from pyspark.sql import SparkSession
from polymo import ApiReader

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)

df = (
    spark.read.format("polymo")
    .option("config_path", "./config.yml")  # YAML you saved from the Builder
    .option("token", "YOUR_TOKEN")  # Only if the API needs one
    .load()
)

df.show()

Structured Streaming works out of the box aswell:

stream_df = (
    spark.readStream.format("polymo")
    .option("config_path", "./config.yml")
    .option("stream_batch_size", 100)
    .option("stream_progress_path", "/tmp/polymo-progress.json")
    .load()
)

query = stream_df.writeStream.format("memory").outputMode("append").queryName("polymo")
query.show()

Does it perform? Polymo can read in batches (pages in parallel) and therefore is much faster than row based solutions like UDFs.

It's still early days, but Polymo already supports a lot of features!

  • Various Authentication options
  • Many Pagination patterns, plus automatic partition-aware reading when totals are exposed.
  • Several partitioning stategies for parallel Spark reads.
  • Incremental sync support with cursor parameters, JSON state files on local or remote storage, optional memory caching, and overrideable state keys.
  • Schema controls that auto-infer types or accept Spark SQL schemas, along with record selectors, filtering expressions, and schema-based casting for nested responses.
  • Structured Streaming compatibility with spark.readStream, tunable batch sizing, durable progress tracking, and a streaming smoke test mode.
  • Error handling through configurable retry counts, status code lists, timeout handling, and exponential backoff settings.
  • Jinja templating of query parameters gives you a ton of flexibility

How to start?

Locally you probably want to install polymo with the UI:

pip install "polymo[builder]"

This comes with UI deps such as pyspark

Running Polymo on an existing cluster in for instance databricks doesnt require these deps. In that case, just install the bare minimum depa with

pip install polymo

Launch the builder UI

polymo builder

(Optional) Run the Builder in Docker

docker compose up --build builder

Where to Next

Read the docs here

Contributions and early feedback welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polymo-0.8.1.tar.gz (197.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polymo-0.8.1-py3-none-any.whl (210.5 kB view details)

Uploaded Python 3

File details

Details for the file polymo-0.8.1.tar.gz.

File metadata

  • Download URL: polymo-0.8.1.tar.gz
  • Upload date:
  • Size: 197.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.1.tar.gz
Algorithm Hash digest
SHA256 57a17d9af0966f4bd509de6f2cbcd628af257c5cdc9e01f7ee9cc7c71366c770
MD5 b85cff84c5b8263c968a9e582e033d88
BLAKE2b-256 e5c2569fa612cc453ddaa5eb512d1d5965ac78bd5f39decce1bda1df429fe258

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.1.tar.gz:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polymo-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: polymo-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 454bd04895b2146f8db04992c6af759af4eb6a2ff8ae5fecfd1f63080b793f7b
MD5 b921cbcd3d7f160c0b6d7de70b66e93f
BLAKE2b-256 6046b0f2e1c684fceee4756f46bcf523d645c4136c0fe67a414907160ace3022

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.1-py3-none-any.whl:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page