Skip to main content

Declarative REST API ingestion for PySpark

Project description

Polymo

Welcome to Polymo

Polymo makes it super easy to ingest APIs with Pyspark. It's like slicing cake.

My vision is that API ingestion doesn't need heavy, third party tools or hard to maintain custom code. The heck, you don't even need Pyspark skills.

Polymo Builder UI - connector preview screen

How does it work?

Define a config file manually or use the recommended, lightweight builder UI. Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:

from pyspark.sql import SparkSession
from polymo import ApiReader

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)

df = (
    spark.read.format("polymo")
    .option("config_path", "./config.yml")  # YAML you saved from the Builder
    .option("token", "YOUR_TOKEN")  # Only if the API needs one
    .load()
)

df.show()

Structured Streaming works out of the box aswell:

stream_df = (
    spark.readStream.format("polymo")
    .option("config_path", "./config.yml")
    .option("stream_batch_size", 100)
    .option("stream_progress_path", "/tmp/polymo-progress.json")
    .load()
)

query = stream_df.writeStream.format("memory").outputMode("append").queryName("polymo")
query.show()

Does it perform? Polymo can read in batches (pages in parallel) and therefore is much faster than row based solutions like UDFs.

It's still early days, but Polymo already supports a lot of features!

  • Various Authentication options
  • Many Pagination patterns, plus automatic partition-aware reading when totals are exposed.
  • Several partitioning stategies for parallel Spark reads.
  • Incremental sync support with cursor parameters, JSON state files on local or remote storage, optional memory caching, and overrideable state keys.
  • Schema controls that auto-infer types or accept Spark SQL schemas, along with record selectors, filtering expressions, and schema-based casting for nested responses.
  • Structured Streaming compatibility with spark.readStream, tunable batch sizing, durable progress tracking, and a streaming smoke test mode.
  • Error handling through configurable retry counts, status code lists, timeout handling, and exponential backoff settings.
  • Jinja templating of query parameters gives you a ton of flexibility

How to start?

Locally you probably want to install polymo with the UI:

pip install "polymo[builder]"

This comes with UI deps such as pyspark

Running Polymo on an existing cluster in for instance databricks doesnt require these deps. In that case, just install the bare minimum depa with

pip install polymo

Launch the builder UI

polymo builder

(Optional) Run the Builder in Docker

docker compose up --build builder

Where to Next

Read the docs here

Contributions and early feedback welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polymo-0.8.0.tar.gz (197.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polymo-0.8.0-py3-none-any.whl (210.1 kB view details)

Uploaded Python 3

File details

Details for the file polymo-0.8.0.tar.gz.

File metadata

  • Download URL: polymo-0.8.0.tar.gz
  • Upload date:
  • Size: 197.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.0.tar.gz
Algorithm Hash digest
SHA256 3e2cff43969d0a3b10c9fd94c4ca631d28309adb4509bbddfe118bb26b9e182c
MD5 9c6885ded3b8e31448ffac15ca86ad68
BLAKE2b-256 823521db0465d4b64ace849200d3007ea026f8743fa4b6dca43b690d47d9de74

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.0.tar.gz:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polymo-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: polymo-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 210.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f42e1243a1f5354722c57d6c304bb02047246cc34fafddfb2dfabb3f038cbbac
MD5 0395c410f2385965749f06c08fc77b29
BLAKE2b-256 3aae29ad5f1717a096ac186805affd3fb2af44d515531ed064cf4c0d33c17924

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.0-py3-none-any.whl:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page