Skip to main content

PySpark antipattern linter for CI/CD pipelines

Project description

PyPI - Version Release PyPI - Python Version GitHub Issues or Pull Requests

pyspark-antipattern

A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.

This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.


Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.


Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

  • 0 — no errors (warnings are allowed)
  • 1 — one or more error-level violations found

CLI output

Default output — violations only:

Default behavior

With show_information = true — inline explanation for each rule:

Show information

With show_best_practice = true — best-practice guidance for each rule:

Show best practice


Rules

Rules are organized by category in the docs/rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.

Category Folder Focus
ARR — Array docs/rules/arr/ Array function antipatterns
D — Driver docs/rules/driver/ Actions that pull data to the driver node
F — Format docs/rules/format/ Code style and DataFrame API misuse
L — Looping docs/rules/looping/ DataFrame operations inside loops
P — Pandas docs/rules/pandas/ Pandas interop pitfalls
PERF — Performance docs/rules/performance/ Runtime performance antipatterns
S — Shuffle docs/rules/shuffle/ Joins, partitioning, and data movement
U — UDF docs/rules/udf/ User-defined functions and their alternatives

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Rules listed here cause exit code 1 (default: all rules are failing)
# failing_rules = []

# Downgrade these rules from error to warning (exit code stays 0)
warning_rules = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore_rules = ["S004"]                # silence one rule
# ignore_rules = ["F"]                 # silence all F rules
# ignore_rules = ["S", "L", "D001"]    # silence all S and L rules

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

# Directories to skip during recursive scanning (default: common build/venv dirs)
# exclude_dirs = ["my_generated_code", "vendor"]

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.


Author

Skander Boudawaraskander.education@proton.me

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_antipattern-0.1.5-py3-none-win_amd64.whl (1.4 MB view details)

Uploaded Python 3Windows x86-64

pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

pyspark_antipattern-0.1.5-py3-none-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

pyspark_antipattern-0.1.5-py3-none-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file pyspark_antipattern-0.1.5-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.5-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 c35e237fde2cea541930bccde0ea0d3af063917494fbcd7c6a6e53381fed5461
MD5 f0a896f2a3601f110fac8bb47fc57a1d
BLAKE2b-256 bfc596988039702de3c38cba6fece3c6c5ef646cdb14be5542bbf966ecb3b2ec

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de042b85ef1bd5d40e3684a1b9c90d1b8584b1c01eecb7b290d72d080d9e595a
MD5 35cad9f23eba0bb4eae95fa03db202f1
BLAKE2b-256 2f19c47a03185892ea9f4814e852f2b99a69849ffffc87bd56b702eb9acc7b6e

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.5-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1e8bb70e0aaed239296fc2c9704499135f10abe64582925ff7758c721df68f38
MD5 48d468d2208a76243f814e821c9eee6c
BLAKE2b-256 3f5b110c30787015cc7ee2858a16cf739ce44eaac8a257054a5f6dba6d092cf1

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.5-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.5-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 15d0adaccd15af203264b85fd33a5b60064d971d1f537f9057e3b680732112d9
MD5 6ffcfbc0da5c0f3fdb4b463db6c450a2
BLAKE2b-256 ad8dcfedbf8108b5c1d1be45d1890cf0df460286a14619ec38cb963ecf0faf19

See more details on using hashes here.

File details

Details for the file pyspark_antipattern-0.1.5-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyspark_antipattern-0.1.5-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1dde9efac981692c9ca7c6f9a353dddf7b67906400f7e4d978d7e074d7cb4c00
MD5 ef47ee6255d13fdccdf336a0c9b51f86
BLAKE2b-256 3b16d10e27a643dfc164fa0f35dcbd3d4bc608c751d4b83c8c913be9de222b65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page