Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

Rules

Rules are organized by category in the rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
D — Driver rules/driver/ Actions that pull data to the driver node
F — Format rules/format/ Code style and DataFrame API misuse
L — Looping rules/looping/ DataFrame operations inside loops
P — Pandas rules/pandas/ Pandas interop pitfalls
S — Shuffle rules/shuffle/ Joins, partitioning, and data movement
U — UDF rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Rules listed here cause exit code 1 (default: all rules are failing)
# failing_rules = []

# Downgrade these rules from error to warning (exit code stays 0)
warning_rules = ["F008", "F011"]

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.1.2-py3-none-win_amd64.whl (1.3 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

pyspark_antipattern-0.1.2-py3-none-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.1.2-py3-none-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.1.2-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 37c3b283524e02affdfb27c1c7c12752a17be5d263a207c62c63f349f9cc911d
MD5 3093c5ad64fcfa1c9ddc6394e753cee1
BLAKE2b-256 a3bda8fecf9f51c9bd708299416ae87c4f20278eb7dd9ab8e383433c048f4699

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b477bcced5ccb1e63c495dfaef9a32e2f30701d6c0c055171df4f564d872b8d8
MD5 16bc9bcf072df05f547e2ead4a7a7506
BLAKE2b-256 97d003386dae1abb42184b65b303fc01553ab1b4ea45166967ef6f649956bcec

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 87bf4f22de83fc02bbdab7576f09b986332313298f30f48d41eda76a66389bc9
MD5 e5e721ba4f43075d1ec3d0ef200686ed
BLAKE2b-256 59f3fd1570b32482a55e1d9108ff8e88298be5c198078cd999171b89934a470f

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1728dbaec96b54af77604195d9e49105a5b8b99d740f5b03ff4201d8597b5dbe
MD5 420fd2248905058ca40cdade1b0c52c4
BLAKE2b-256 a5fe212060d13b373754c8653fddd989a26655d191f25e228aaabaee853e8823

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.2-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.2-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 37e4f2dd48e4826a9a0ecf3c67c71ef14279682aed88f6474a2ad6d7a8c64ec0
MD5 0185d69cf111c00beb7637f0d2b6b6bc
BLAKE2b-256 125d34a7c88e3d58f0177dfc919fadc4757216d24e9fc31631bc71731ba60b3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page