PySpark antipattern linter for CI/CD pipelines
Project description
pyspark-antipattern
A fast, opinionated PySpark linter that challenges your code against antipattern rules — written in Rust, installable as a Python package, and designed to run in CI/CD pipelines.
This linter is intentionally strict. It will flag patterns that are technically valid Python but known to cause performance, scalability, or maintainability problems in PySpark. Every violation is a conversation starter, not necessarily a hard blocker — it is up to you to decide whether to fix it, downgrade it to a warning, or suppress it for a specific line. The goal is to make the trade-offs visible before they become production incidents.
Why this exists
PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.
Installation
pip install pyspark-antipattern
Usage
Check a single file:
pyspark-antipattern check pipeline.py
Check an entire directory recursively:
pyspark-antipattern check src/
Use a custom config location:
pyspark-antipattern check src/ --config path/to/pyproject.toml
Exit codes
0— no errors (warnings are allowed)1— one or more error-level violations found
CLI output
Default output — violations only:
With show_information = true — inline explanation for each rule:
With show_best_practice = true — best-practice guidance for each rule:
Rules
Full documentation is available at https://skanderboudawara.github.io/pyspark-antipattern/.
Rules are organized by category in the docs/rules/ folder. Each rule has its own markdown file with a full explanation and best-practice guidance.
| Category | Folder | Focus |
|---|---|---|
| ARR — Array | docs/rules/arr/ |
Array function antipatterns |
| D — Driver | docs/rules/driver/ |
Actions that pull data to the driver node |
| F — Format | docs/rules/format/ |
Code style and DataFrame API misuse |
| L — Looping | docs/rules/looping/ |
DataFrame operations inside loops |
| P — Pandas | docs/rules/pandas/ |
Pandas interop pitfalls |
| PERF — Performance | docs/rules/performance/ |
Runtime performance antipatterns |
| S — Shuffle | docs/rules/shuffle/ |
Joins, partitioning, and data movement |
| U — UDF | docs/rules/udf/ |
User-defined functions and their alternatives |
Configuration
Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:
[tool.pyspark-antipattern]
# Show only these rules — everything else is silenced (default: all active)
# select = ["D001", "S"]
# Downgrade these rules from error to warning (exit code stays 0)
warn = ["F008", "F011"]
# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore = ["S004"] # silence one rule
# ignore = ["F"] # silence all F rules
# ignore = ["S", "L", "D001"] # silence all S and L rules
# Show inline explanation for each rule that fired (default: false)
show_information = false
# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false
# PERF003: fire when more than N shuffle ops occur without a checkpoint (default: 9)
max_shuffle_operations = 9
# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5
# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3
# L001/L002/L003: flag for-loops where range(N) > threshold;
# while-loops always assume 99 iterations (default: 10)
loop_threshold = 10
# Directories to skip during recursive scanning (default: common build/venv dirs)
# exclude_dirs = ["my_generated_code", "vendor"]
Suppressing a specific line
Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:
result = df.collect() # noqa: pap: D001
bad_join = df.crossJoin(other) # noqa: pap: S010, S002
CI/CD integration
GitHub Actions
- name: Lint PySpark code
run: |
pip install pyspark-antipattern
pyspark-antipattern check src/
The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.
Pre-commit hook
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: pyspark-antipattern
name: PySpark antipattern linter
entry: pyspark-antipattern check
language: system
types: [python]
pass_filenames: false
args: ["src/"]
A word on strictness
This linter will challenge code that your team may have written deliberately and knowingly. That is by design.
Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.
The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.
Author
Skander Boudawara — skander.education@proton.me
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_antipattern-0.2.2-py3-none-win_amd64.whl.
File metadata
- Download URL: pyspark_antipattern-0.2.2-py3-none-win_amd64.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69da5f7d3bd6b01af49b369a1d4f6812122378548b958bc11d407572600dcfe2
|
|
| MD5 |
1af06380eee617aa190767cdc70a4b2e
|
|
| BLAKE2b-256 |
d5f1190ab9ae891169cce94208fd422c4be50aa10ba96c3fdf555947f10283b6
|
File details
Details for the file pyspark_antipattern-0.2.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pyspark_antipattern-0.2.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50ef6bb3653361d0f187959b47bc6c7544b666879163ca63940379f839cd171a
|
|
| MD5 |
dbc89ecc2f9551ace910d05235f4deca
|
|
| BLAKE2b-256 |
05c95f13c939d4296ffadb23e1dee0827d61ce10aff4f54a89b344eb7ca7390d
|
File details
Details for the file pyspark_antipattern-0.2.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: pyspark_antipattern-0.2.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf74235cdaf66eb02f28a92b622258e029c5c0bfe6a3fe6589d65ac8c8766082
|
|
| MD5 |
0aa91c21c410f7e664f05f2515e599ec
|
|
| BLAKE2b-256 |
f93ac017adcaf95d6dc287ba24734dd2574018fc8a1fee8efdef6f8bad7a7b79
|
File details
Details for the file pyspark_antipattern-0.2.2-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: pyspark_antipattern-0.2.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c226194a2e7895a7a8c185440849d330492562daf64eb97f67fa5e3088f28d20
|
|
| MD5 |
ceb56039d0282c5f954d05ece5fc2c25
|
|
| BLAKE2b-256 |
76d20dbca301908b2787123a4b378eb2b3bf81ea3d843e7d971bc43e23dd0ee6
|
File details
Details for the file pyspark_antipattern-0.2.2-py3-none-macosx_10_12_x86_64.whl.
File metadata
- Download URL: pyspark_antipattern-0.2.2-py3-none-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84b2120ef55a6443be2bde1197be7b7e62ed78f12cf577cc01594c6a7b210830
|
|
| MD5 |
5b81582780ecfee3fb03857e5b6ba437
|
|
| BLAKE2b-256 |
86913e67aba519bc39526c388a9f2d7ffa7cd0b2b96ca00f07e1e0e1cf7710aa
|