PySpark anti-pattern linter — catch the code that costs you money
Project description
cylint
A PySpark linter that catches the anti-patterns costing you real money.
Static analysis for PySpark code. No Spark runtime needed. Zero dependencies. Runs anywhere Python runs.
Install
pip install cylint
Usage
# Lint files or directories
cy lint src/pipelines/
# JSON output for CI
cy lint --format json src/
# Only warnings and critical
cy lint --min-severity warning .
Example output:
pipeline.py:47:8: CY003 [critical] .withColumn() inside a loop creates O(n²) plan complexity.
Use .select([...]) with all column expressions instead.
pipeline.py:82:4: CY001 [warning] .collect() called without filtering.
Consider .limit(N).collect(), .take(N), or using .show() for inspection.
pipeline.py:103:4: CY005 [warning] .cache() with single downstream use.
Cache is only beneficial when the same DataFrame is used in multiple actions.
Found 3 issues (1 critical, 2 warnings) in 1 file.
Rules
| Rule | Severity | What it catches |
|---|---|---|
| CY001 | warning | .collect() without .filter() or .limit() — the #1 OOM cause |
| CY002 | warning | UDF where a builtin exists (e.g. udf(lambda x: x.lower()) → F.lower()) |
| CY003 | critical | .withColumn() in a loop — creates O(n²) Catalyst plans |
| CY004 | info | SELECT * in spark.sql() strings — prevents column pruning |
| CY005 | warning | .cache() / .persist() with ≤1 downstream use — wastes memory |
| CY006 | warning | .toPandas() on unfiltered DataFrame — collects everything to driver |
| CY007 | critical | .crossJoin() or .join() without condition — cartesian product |
| CY008 | info | .repartition() before .write() — unnecessary shuffle |
| CY009 | critical | UDF in .filter()/.where() — blocks predicate pushdown |
| CY010 | warning | .join() without explicit how= — ambiguous join type |
| CY011 | warning | .withColumnRenamed()/.drop() in a loop — O(n²) plan nodes |
| CY012 | warning | .show()/.display()/.printSchema() left in production code |
| CY013 | warning | .coalesce(1) before .write() — single-executor bottleneck |
| CY014 | critical | Multiple actions without .cache() — recomputes full lineage each time |
| CY015 | critical | Non-equi .join() condition — implicit cartesian product |
| CY016 | info | Invalid escape sequence in string literal — use raw strings for regex |
List all rules:
cy rules
How it works
cylint uses Python's ast module to parse your source files and track DataFrame variables through assignment chains. It knows that anything coming from spark.read.*, spark.sql(), or spark.table() is a DataFrame, and follows method chains from there.
No type stubs. No Spark installation. No imports resolved. Just fast, heuristic analysis that catches the patterns that matter.
Configuration
Out of the box, every rule runs at its default severity with no exclusions. No config file needed.
If a rule doesn't apply to your codebase, or you want to skip certain directories, drop a .cylint.yml in your project root or add a [tool.cylint] section to your existing pyproject.toml. The linter picks it up automatically.
.cylint.yml
# Only fail on warnings and above (ignore info-level findings)
min-severity: warning
rules:
CY004: off # we use SELECT * intentionally in dynamic queries
CY008: warning # promote repartition-before-write to warning
exclude:
- tests/
- vendor/
- notebooks/scratch/
pyproject.toml
[tool.cylint]
min-severity = "warning"
exclude = ["tests/", "notebooks/scratch/"]
[tool.cylint.rules]
CY004 = "off"
CY008 = "warning"
CI Integration
GitHub Actions
name: PySpark Lint
on: pull_request
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install cylint
- run: cy lint --format github src/
The --format github flag outputs findings as workflow annotations — they appear inline on the PR diff.
pre-commit
repos:
- repo: https://github.com/clusteryield/cylint
hooks:
- id: spark-lint
args: [--min-severity, warning]
Exit codes
| Code | Meaning |
|---|---|
| 0 | No findings |
| 1 | Warnings or info findings |
| 2 | Critical findings |
Why these rules?
Every rule targets a pattern that either causes OOM crashes, triggers unnecessary shuffles, or prevents Spark's Catalyst optimizer from doing its job. These aren't style opinions — they're the patterns you find in postmortems after a 3am page about a failed pipeline or a $40K surprise on your Databricks bill.
If you've read a "PySpark anti-patterns to avoid" blog post, you've seen these patterns described. This tool catches them automatically, before the code hits production.
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cylint-0.1.1.tar.gz.
File metadata
- Download URL: cylint-0.1.1.tar.gz
- Upload date:
- Size: 33.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cc8a4534d5e4c138d8b7b2856048da1ab934b44672b790e8833a9cc401ea73a
|
|
| MD5 |
85e95f34727fa784753d17be074fb19a
|
|
| BLAKE2b-256 |
6c15a309518d1eaf344e73f61674c9ae21b6d91dd6bfd089a37144b5c5282e51
|
File details
Details for the file cylint-0.1.1-py3-none-any.whl.
File metadata
- Download URL: cylint-0.1.1-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
266510d795c28b4cb43045c3c829b9f02d8f1a5ec5541c7643ba425721caf879
|
|
| MD5 |
caf00605a6891cee98516dff19a16c2e
|
|
| BLAKE2b-256 |
b345a8c3f86d2aba479fbb0117af2e193cbe88e44ee32734831c5fe05636be39
|