PySpark anti-pattern linter — catch the code that costs you money
Project description
cylint
A PySpark linter that catches various antipatterns.
Static analysis for PySpark code. No Spark runtime needed. Zero dependencies. Runs anywhere Python runs.
Install
pip install cylint
Usage
# Lint files or directories
cy lint src/pipelines/
# JSON output for CI
cy lint --format json src/
# Only warnings and critical
cy lint --min-severity warning .
Example output:
pipeline.py:47:8: CY003 [critical] .withColumn() inside a loop creates O(n²) plan complexity.
Use .select([...]) with all column expressions instead.
pipeline.py:82:4: CY001 [warning] .collect() called without filtering.
Consider .limit(N).collect(), .take(N), or using .show() for inspection.
pipeline.py:103:4: CY005 [warning] .cache() with single downstream use.
Cache is only beneficial when the same DataFrame is used in multiple actions.
Found 3 issues (1 critical, 2 warnings) in 1 file.
Rules
| Rule | Severity | What it catches |
|---|---|---|
| CY001 | warning | .collect() without .filter() or .limit() — the #1 OOM cause |
| CY002 | warning | UDF where a builtin exists (e.g. udf(lambda x: x.lower()) → F.lower()) |
| CY003 | critical | .withColumn() in a loop — creates O(n²) Catalyst plans |
| CY004 | info | SELECT * in spark.sql() strings — prevents column pruning |
| CY005 | warning | .cache() / .persist() with ≤1 downstream use — wastes memory |
| CY006 | warning | .toPandas() on unfiltered DataFrame — collects everything to driver |
| CY007 | critical | .crossJoin() or .join() without condition — cartesian product |
| CY008 | info | .repartition() before .write() — unnecessary shuffle |
| CY009 | critical | UDF in .filter()/.where() — blocks predicate pushdown |
| CY010 | warning | .join() without explicit how= — ambiguous join type |
| CY011 | warning | .withColumnRenamed()/.drop() in a loop — O(n²) plan nodes |
| CY012 | warning | .show()/.display()/.printSchema() left in production code |
| CY013 | warning | .coalesce(1) before .write() — single-executor bottleneck |
| CY014 | critical | Multiple actions without .cache() — recomputes full lineage each time |
| CY015 | critical | Non-equi .join() condition — implicit cartesian product |
| CY016 | info | Invalid escape sequence in string literal — use raw strings for regex |
| CY017 | warning | Window.orderBy() without .partitionBy() — full-table sort into one partition |
| CY018 | warning | spark.read.csv()/.json() without explicit schema — double file scan |
| CY020 | warning | .count() == 0 for emptiness check — full scan wasted |
| CY025 | warning | .cache()/.persist() without .unpersist() — memory leak |
| CY031 | warning | for row in df.collect() — driver-side row iteration defeats Spark |
List all rules:
cy rules
How it works
cylint uses Python's ast module to parse your source files and track DataFrame variables through assignment chains. It knows that anything coming from spark.read.*, spark.sql(), or spark.table() is a DataFrame, and follows method chains from there.
No type stubs. No Spark installation. No imports resolved. Just fast, heuristic analysis that catches the patterns that matter.
Configuration
Out of the box, every rule runs at its default severity with no exclusions. No config file needed.
If a rule doesn't apply to your codebase, or you want to skip certain directories, drop a .cylint.yml in your project root or add a [tool.cylint] section to your existing pyproject.toml. The linter picks it up automatically.
.cylint.yml
# Only fail on warnings and above (ignore info-level findings)
min-severity: warning
rules:
CY004: off # we use SELECT * intentionally in dynamic queries
CY008: warning # promote repartition-before-write to warning
exclude:
- tests/
- vendor/
- notebooks/scratch/
pyproject.toml
[tool.cylint]
min-severity = "warning"
exclude = ["tests/", "notebooks/scratch/"]
[tool.cylint.rules]
CY004 = "off"
CY008 = "warning"
Inline Suppression
Suppress individual findings with # cy:ignore comments:
df.collect() # cy:ignore CY001
# Suppress multiple rules
df.show() # cy:ignore CY001,CY012
# Suppress all rules on a line
df.collect() # cy:ignore
CI Integration
GitHub Actions
name: PySpark Lint
on: pull_request
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install cylint
- run: cy lint --format github src/
The --format github flag outputs findings as workflow annotations — they appear inline on the PR diff.
pre-commit
repos:
- repo: https://github.com/clusteryield/cylint
hooks:
- id: spark-lint
args: [--min-severity, warning]
Exit codes
| Code | Meaning |
|---|---|
| 0 | No findings |
| 1 | Warnings or info findings |
| 2 | Critical findings |
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cylint-0.1.5.tar.gz.
File metadata
- Download URL: cylint-0.1.5.tar.gz
- Upload date:
- Size: 83.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcb171baecf3c069d700b0871138144633225b0da46cdbc6aae93c0d02110813
|
|
| MD5 |
dc8e76f606ad35b1ebdd5bd8b6b6abfc
|
|
| BLAKE2b-256 |
99ede3abc26c8be6c46f97deefac339c03ef9d4e131066625ebbb2c1d3ffcc49
|
File details
Details for the file cylint-0.1.5-py3-none-any.whl.
File metadata
- Download URL: cylint-0.1.5-py3-none-any.whl
- Upload date:
- Size: 73.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c13813a3fb7ac6058a1a7802b23eb0437d4091b646bf8eeb4aa153cb616072b
|
|
| MD5 |
9591074f4e0f2a0393c3898e0d5ad561
|
|
| BLAKE2b-256 |
045f12fde9b8ada6e76608112cd8626fdd2def0725d7b902498da7df146d3fb7
|