Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

Rules

Rules are organized by category in the rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
D — Driver rules/driver/ Actions that pull data to the driver node
F — Format rules/format/ Code style and DataFrame API misuse
L — Looping rules/looping/ DataFrame operations inside loops
P — Pandas rules/pandas/ Pandas interop pitfalls
S — Shuffle rules/shuffle/ Joins, partitioning, and data movement
U — UDF rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Rules listed here cause exit code 1 (default: all rules are failing)
# failing_rules = []

# Downgrade these rules from error to warning (exit code stays 0)
warning_rules = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore_rules = ["S004"]        # silence one rule
# ignore_rules = ["F"]         # silence all F rules
# ignore_rules = ["S", "L"]    # silence all S and L rules

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.1.3-py3-none-win_amd64.whl (1.3 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

pyspark_antipattern-0.1.3-py3-none-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.1.3-py3-none-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.1.3-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 f565453687aaee0b509279116fa6eae65cef8ba89b2c6053bc7050460b9e4fed
MD5 399e0365679d0cf31926fbfa8c521245
BLAKE2b-256 d8908e318534217567aecfd63dd90c3d58391b517416e2949706345a5e185e1e

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c0a587f80e6bf565bfe038f3d573950d235abc00cbc7b907df067183efa7c5c0
MD5 2ab331260bbeebc7ff5898f2366ac316
BLAKE2b-256 83f9164d32e371e31b0d1cd9169e05748d61de37c9d0ca0f227a14f19069f7e2

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 599bbc90c0921da49ea6ebcecd77108496cd79b99a56a39c83872abc50044109
MD5 2aa7a3effdaf3f306ddd9f44f9fc7c72
BLAKE2b-256 014d5e6a0e2d5773cb98a673a3778ea09c66e33e278eb908988b1f804c253116

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9043236d5293a5802e80e3a107a5c24ac9816fea02e295f58d56a3a274f7957
MD5 91079b618a19fc20d9284002ec798e79
BLAKE2b-256 10f51b14d35fccec086e70458ac90265f2fa9faf469515a9d8f1a66474d99a5f

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.3-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.3-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c9440f4cd2237e8747e79edabf02bde4ea4aa43c0a33e4f1b66526929792f1dc
MD5 4977eb3ba9c99e5c04c78113c9bce147
BLAKE2b-256 40fa0f8251c846f45f6fc78f1805da888548247c28da83163e810330931ffa9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page