Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

Rules

Rules are organized by category in the rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
D — Driver rules/driver/ Actions that pull data to the driver node
F — Format rules/format/ Code style and DataFrame API misuse
L — Looping rules/looping/ DataFrame operations inside loops
P — Pandas rules/pandas/ Pandas interop pitfalls
S — Shuffle rules/shuffle/ Joins, partitioning, and data movement
U — UDF rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Rules listed here cause exit code 1 (default: all rules are failing)
# failing_rules = []

# Downgrade these rules from error to warning (exit code stays 0)
warning_rules = ["F008", "F011"]

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


License

MIT — see LICENSE


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.1.0-py3-none-win_amd64.whl (1.3 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

pyspark_antipattern-0.1.0-py3-none-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.1.0-py3-none-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.1.0-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 0cf57de346c24c3ef78b34bba878011934d6932e58ba84cfddc83e7b7ab27c3f
MD5 7853993c8f847d514fc9fb6e2ac8dcb0
BLAKE2b-256 2d447cc3d8b7cfd40a696ee3559089b1089a77b01e22f0db9bfae0a8e1b393ad

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bdac951153761a3253e62f3b73a49a67e632026cf58c695dc5215e320498cd1
MD5 6393d474c8ca09861471e3f1bb730335
BLAKE2b-256 af97be359e7483d754a4874b69e5a9ad6f70d42d70f55c4215aa2ec9773b0bff

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 47f41203bfdfbb628d0de9d0c113278e8c4072d54c1b2bb2d7036ecd373d925b
MD5 4bb0517d697e6c4538958074405e6888
BLAKE2b-256 eeb60b106192532fd187e8b7b56ea5d4d42b13e087553a8ecf8a7ed99a1328cd

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 93998cab3d9b4b6fdf6e52b270c3de78c8f170f89fb73fd0b1f5c757ff377600
MD5 8accc4f2db9c55ec4b7a255c59399096
BLAKE2b-256 8c4d4f437ef94fd107d7f1753dcf8691814002d1782b5a9c33c89e1ddd24e89f

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 464c4dbfc29deda40d7653fc4f9f72f560f757147092bbbc3b45136e312c7cfd
MD5 7c17aceadcbeb1abe0e9f96088fca0d9
BLAKE2b-256 38deaff2761a542f46a1c118fe4c78accea8b4990a36f17e5109c36a1b61102b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page