Skip to main content

Declarative REST API ingestion for PySpark

Project description

Polymo

Welcome to Polymo

Polymo makes it super easy to ingest APIs with Pyspark. It's like slicing cake.

My vision is that API ingestion doesn't need heavy, third party tools or hard to maintain custom code. The heck, you don't even need Pyspark skills.

Polymo Builder UI - connector preview screen

How does it work?

Define a config file manually or use the recommended, lightweight builder UI. Once you are happy with your config, all you need to do is register the Polymo reader and tell Spark where to find the config:

from pyspark.sql import SparkSession
from polymo import ApiReader

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ApiReader)

df = (
    spark.read.format("polymo")
    .option("config_path", "./config.yml")  # YAML you saved from the Builder
    .option("token", "YOUR_TOKEN")  # Only if the API needs one
    .load()
)

df.show()

Structured Streaming works out of the box aswell:

stream_df = (
    spark.readStream.format("polymo")
    .option("config_path", "./config.yml")
    .option("stream_batch_size", 100)
    .option("stream_progress_path", "/tmp/polymo-progress.json")
    .load()
)

query = stream_df.writeStream.format("memory").outputMode("append").queryName("polymo")
query.show()

Does it perform? Polymo can read in batches (pages in parallel) and therefore is much faster than row based solutions like UDFs.

It's still early days, but Polymo already supports a lot of features!

  • Various Authentication options
  • Many Pagination patterns, plus automatic partition-aware reading when totals are exposed.
  • Several partitioning stategies for parallel Spark reads.
  • Incremental sync support with cursor parameters, JSON state files on local or remote storage, optional memory caching, and overrideable state keys.
  • Schema controls that auto-infer types or accept Spark SQL schemas, along with record selectors, filtering expressions, and schema-based casting for nested responses.
  • Structured Streaming compatibility with spark.readStream, tunable batch sizing, durable progress tracking, and a streaming smoke test mode.
  • Error handling through configurable retry counts, status code lists, timeout handling, and exponential backoff settings.
  • Jinja templating of query parameters gives you a ton of flexibility

How to start?

Locally you probably want to install polymo with the UI:

pip install "polymo[builder]"

This comes with UI deps such as pyspark

Running Polymo on an existing cluster in for instance databricks doesnt require these deps. In that case, just install the bare minimum depa with

pip install polymo

Launch the builder UI

polymo builder

(Optional) Run the Builder in Docker

docker compose up --build builder

Where to Next

Read the docs here

Contributions and early feedback welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polymo-0.8.2.tar.gz (197.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polymo-0.8.2-py3-none-any.whl (210.6 kB view details)

Uploaded Python 3

File details

Details for the file polymo-0.8.2.tar.gz.

File metadata

  • Download URL: polymo-0.8.2.tar.gz
  • Upload date:
  • Size: 197.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.2.tar.gz
Algorithm Hash digest
SHA256 475a2892fccaa4283535673e423260a8601ff77bf65e5369bdd63a5228d1e002
MD5 004a5937ee24d636c0e6f41e10391142
BLAKE2b-256 f83043cc37492c5688d201ac9df02a282efbd9d99b51755c48e44e589b788efd

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.2.tar.gz:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polymo-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: polymo-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 210.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polymo-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aa8bc1d84e2fd9de71483c0dfc7cfe68fb7a07bd01da15905a893e9bfd5d80f3
MD5 67fd14f47aea2d0f078944d1061a8680
BLAKE2b-256 536d4919179b3ac506ad798b828611e174090d3a98ef7841939ece7d06bfe082

See more details on using hashes here.

Provenance

The following attestation bundles were made for polymo-0.8.2-py3-none-any.whl:

Publisher: release.yml on dan1elt0m/polymo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page