Skip to main content

A static analysis linter for PySpark — catches performance antipatterns before they reach your cluster

Project description

PyPI - Version PyPI - Downloads Release PyPI - Python Version GitHub Issues or Pull Requests GitHub Stars Documentation

pyspark-antipattern

A static analysis linter for PySpark — catch performance antipatterns before they reach your cluster.

Written in Rust, installable as a Python package, and designed to run in CI/CD pipelines. +60 rules across 8 categories covering driver actions, shuffle explosions, UDFs, loops, and more.

demo.gif


What it catches

Real antipatterns, caught at commit time:

Code Rule Why it matters
df.collect() D001 Pulls all data to driver — OOM risk on large datasets
for c in cols: df.withColumn(...) L003 Each call adds a projection — plan explodes exponentially
array_distinct(collect_list(x)) ARR001 Use collect_set(x) — one step instead of two
df.rdd.collect() PERF001 Use .toPandas() — 10x faster with Arrow enabled
df.join(other) S011 No condition = Cartesian product
@udf returning StringType U001 Built-in string functions are orders of magnitude faster

Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

CLI output

Default output — violations only:

Default behavior

Each violation line includes a colored severity badge — [HIGH] in red, [MEDIUM] in yellow, [LOW] in green — immediately after the rule ID:

error[D001][HIGH]: Avoid using collect()
  --> pipeline.py:42:10

Filter by your cluster's PySpark version to suppress rules for newer APIs:

pyspark-antipattern check src/ --pyspark-version=3.3  # suppress rules requiring 3.4+

Filter by severity directly from the CLI:

pyspark-antipattern check src/ --severity=high    # only HIGH violations
pyspark-antipattern check src/ --severity=medium  # MEDIUM and HIGH

With show_information = true — inline explanation for each rule:

Show information

With show_best_practice = true — best-practice guidance for each rule:

Show best practice


Rules

Full documentation is available at https://skanderboudawara.github.io/pyspark-antipattern/.

Rules are organized by category in the docs/rules/ folder. Each rule has its own markdown file with a full explanation, best-practice guidance, and a severity badge indicating its performance impact.

Category Folder Focus
ARR — Array docs/rules/arr/ Array function antipatterns
D — Driver docs/rules/driver/ Actions that pull data to the driver node
F — Format docs/rules/format/ Code style and DataFrame API misuse
L — Looping docs/rules/looping/ DataFrame operations inside loops
P — Pandas docs/rules/pandas/ Pandas interop pitfalls
PERF — Performance docs/rules/performance/ Runtime performance antipatterns
S — Shuffle docs/rules/shuffle/ Joins, partitioning, and data movement
U — UDF docs/rules/udf/ User-defined functions and their alternatives

Each rule carries a severity reflecting its performance impact:

Severity Meaning
🔴 HIGH Major performance impact — OOM risk, full scans, shuffle explosion
🟡 MEDIUM Moderate performance impact — avoidable overhead at scale
🟢 LOW Minor impact — style, API correctness, small inefficiencies

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Show only these rules — everything else is silenced (default: all active)
# select = ["D001", "S"]

# Cluster PySpark version — silences rules requiring a newer version (default: all)
# pyspark_version = "3.3"     # suppress rules that require PySpark 3.4+

# Downgrade these rules from error to warning (exit code stays 0)
warn = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore = ["S004"]                # silence one rule
# ignore = ["F"]                 # silence all F rules
# ignore = ["S", "L", "D001"]    # silence all S and L rules

# Only report violations at or above this performance-impact level (default: all)
# severity = "medium"            # show only MEDIUM and HIGH violations
# severity = "high"              # show only HIGH violations

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# PERF003: fire when more than N shuffle ops occur without a checkpoint (default: 9)
max_shuffle_operations = 9

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

# Directories to skip during recursive scanning (default: common build/venv dirs)
# exclude_dirs = ["my_generated_code", "vendor"]

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.4.1-py3-none-win_amd64.whl (1.5 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.4.1-py3-none-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ ARM64

pyspark_antipattern-0.4.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.4.1-py3-none-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.4.1-py3-none-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.4.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.4.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.4.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 52df2fb0b06af2550bb228ec2df251c6a0220be06dcbe5d6c83b8a82125255fb
MD5 ea9ce27f3730a072865b58db64357857
BLAKE2b-256 49226c2899abbfc7ec9ca9ec1d23eed3fc7867489dae6abf98b98d5a9105ec8a

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.4.1-py3-none-manylinux_2_28_aarch64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.4.1-py3-none-manylinux_2_28_aarch64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, manylinux: glibc 2.28+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.4.1-py3-none-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6532d143755c61c19138fdf030618b10835d7d5c0f9b976ba2e579cf2f0c3ad7
MD5 dcf9197ef39c3efd7c756489da821037
BLAKE2b-256 45f6ace42348d3dafb79abbf320fa163329ebbaaae2634f61b7035c57bf5c491

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.4.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.4.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.4.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f83d4f9d55d47450fd2ec2e1bc4b750a259a3db88afa212bb1041d56c26cc8fb
MD5 d26027251e26ddf13508877a17b3dc07
BLAKE2b-256 6716bb49c46f76cc7bd1d4d8b1d6ef96d8c31c11b90137ab4b6219f4e1a61d14

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.4.1-py3-none-macosx_11_0_arm64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.4.1-py3-none-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.4.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2e278cdd89b0416df34adeda46dab93dd8b06e1f4d597afd1daa2acf8e9f7b92
MD5 c4fb8fff69f001713fd8a451e666a35c
BLAKE2b-256 ead5bd1fa13ac5b71f834c5a53f89d6f0f1be6e9437ea3063e3be1753c4d782c

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.4.1-py3-none-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.4.1-py3-none-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.4.1-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 846d08ce6bad0a5992b2be95eecfa28927a168c316e5aaac47726a9996b2f52c
MD5 b693a5c18f29e0ce57a66c968a972442
BLAKE2b-256 c179b72ae35eeb6018c0196290381a527277c93e9e916c222c25187187a9d5e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page