Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

CLI output

Default output — violations only:

Default behavior

With show_information = true — inline explanation for each rule:

Show information

With show_best_practice = true — best-practice guidance for each rule:

Show best practice


Rules

Rules are organized by category in the rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
ARR — Array rules/arr/ Array function antipatterns
D — Driver rules/driver/ Actions that pull data to the driver node
F — Format rules/format/ Code style and DataFrame API misuse
L — Looping rules/looping/ DataFrame operations inside loops
P — Pandas rules/pandas/ Pandas interop pitfalls
PERF — Performance rules/performance/ Runtime performance antipatterns
S — Shuffle rules/shuffle/ Joins, partitioning, and data movement
U — UDF rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Rules listed here cause exit code 1 (default: all rules are failing)
# failing_rules = []

# Downgrade these rules from error to warning (exit code stays 0)
warning_rules = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore_rules = ["S004"]                # silence one rule
# ignore_rules = ["F"]                 # silence all F rules
# ignore_rules = ["S", "L", "D001"]    # silence all S and L rules

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.1.4-py3-none-win_amd64.whl (1.3 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

pyspark_antipattern-0.1.4-py3-none-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.1.4-py3-none-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.1.4-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.4-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 58bca72d32e45cb63e57ec76526f24dc4c8c0e2ad7f28c5ef023a4ad76791f2f
MD5 365ac2a8d49fd201104942ed5f55c361
BLAKE2b-256 a7bb8bbc7e913ad2b891cff64795549e734a223381337782ba4f2feae3e4e0d6

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64241c278336f2c754dcb58b8a485fd8b9a3ca6255327a763f75c44f4a35f7f6
MD5 453854104794dd5582cbde7e3be3c45e
BLAKE2b-256 4c59a63b0ce7e23d56f8749acfc512fbf28eaba6826439a88fc76f0f3236f87b

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.4-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5d38540b6bf7dac7144c0cdf818fb111d06e71b2840e2d51428982564bf93573
MD5 ff90b654b45e8bb11e66c58182529604
BLAKE2b-256 cc15af74277a24b55f872a5e35cdcc94dec1243c1dfeeb670ba3a9ca0fd8f039

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.4-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.4-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 59b5ec9274cd74434b03351d867abf95092c9fdbf0f363a989551f31cb68eebb
MD5 77c6b892773e11b080d6be08c6b9808a
BLAKE2b-256 657215aae9ea43206caeb45d21af9bce75d57b19abc6f754f81c127c50a98050

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.4-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.4-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4387330c17dd5c7af8d593734543758049d8bf3ff75f4ad518baf8fd980c8d74
MD5 b38dec5c7c0aeef4b7cbb0a658e13177
BLAKE2b-256 a4fbacc0b6c2b081e1b50e3b5ba725b8529cb8b32d3d5be855fba6e00aa452a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page