Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

PyPI - Version Release PyPI - Python Version GitHub Issues or Pull Requests Documentation

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

CLI output

Default output — violations only:

Default behavior

With show_information = true — inline explanation for each rule:

Show information

With show_best_practice = true — best-practice guidance for each rule:

Show best practice


Rules

Full documentation is available at https://skanderboudawara.github.io/pyspark-antipattern/.

Rules are organized by category in the docs/rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
ARR — Array docs/rules/arr/ Array function antipatterns
D — Driver docs/rules/driver/ Actions that pull data to the driver node
F — Format docs/rules/format/ Code style and DataFrame API misuse
L — Looping docs/rules/looping/ DataFrame operations inside loops
P — Pandas docs/rules/pandas/ Pandas interop pitfalls
PERF — Performance docs/rules/performance/ Runtime performance antipatterns
S — Shuffle docs/rules/shuffle/ Joins, partitioning, and data movement
U — UDF docs/rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Show only these rules — everything else is silenced (default: all active)
# select = ["D001", "S"]

# Downgrade these rules from error to warning (exit code stays 0)
warn = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore = ["S004"]                # silence one rule
# ignore = ["F"]                 # silence all F rules
# ignore = ["S", "L", "D001"]    # silence all S and L rules

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# PERF003: fire when more than N shuffle ops occur without a checkpoint (default: 9)
max_shuffle_operations = 9

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

# Directories to skip during recursive scanning (default: common build/venv dirs)
# exclude_dirs = ["my_generated_code", "vendor"]

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.2.3-py3-none-win_amd64.whl (1.5 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.2.3-py3-none-manylinux_2_28_aarch64.whl (1.5 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ ARM64

pyspark_antipattern-0.2.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.2.3-py3-none-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.2.3-py3-none-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.2.3-py3-none-win_amd64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.2.3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.2.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 b46a0a8281ca47fa1a32126b97da06acb1e181eb11ba7d7b124d06de8c3e0ead
MD5 a4aa412a05c4f87adb937ff75a13e084
BLAKE2b-256 e15a4cf192f75325d5e60ba5c436f7622f2b2e18e570a9bf81977982dc7fcad9

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.2.3-py3-none-manylinux_2_28_aarch64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.2.3-py3-none-manylinux_2_28_aarch64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, manylinux: glibc 2.28+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.2.3-py3-none-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 29806889686bec4ad52bcf18bbadf72384559725fd56211aedb67a96c7bdd9ad
MD5 36641806340fd6ade3fbaa0becd09570
BLAKE2b-256 c2ae1f37fc6da0ebab284327e276b1dac214e81a06af0077d06f2f19a28703d2

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.2.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.2.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.2.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 019c207899da1e0e5a275c1bd2b200fc4bfa6ec233520527d2de09b61a396dc5
MD5 2c3eb74e37ed521fa6cfc3d764c80907
BLAKE2b-256 6459d1f4417615e7ddda282bcbaf54ffa7d4b5a56ec35bb78c1d50beeadf487b

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.2.3-py3-none-macosx_11_0_arm64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.2.3-py3-none-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.2.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ee5a097c63ee97f93d8122e6acdf517193ee9205aa281901d4a8a8a7a1c680d2
MD5 aa9dfd925edd9e177eab22a7a610034f
BLAKE2b-256 d81082cc279d1ecc7ebab1c4250e54b2926ea26b831aa28ac87ec066621722bf

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.2.3-py3-none-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: pyspark_antipattern-0.2.3-py3-none-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pyspark_antipattern-0.2.3-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dcf275107f89873423de6d37290c5647c730b7c099283591a6fc4c69f979c8de
MD5 9abbcd34caa24986fbc8affe4c8187c3
BLAKE2b-256 cd4285164f5cfbc405ab1df974e19c5d5c86a3d9ec265ba32e46fd7ae3a9248e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page