A static performance linter that detects slow Pandas anti-patterns before they reach production.
Project description
pdperf — Pandas Performance Optimizer
A static linter that catches silent Pandas performance killers before they ship to production.
pdperf scans your Python code for common Pandas anti-patterns that work correctly but are often 10–100× slower at scale than necessary. It's local-first, deterministic, and CI-friendly — no code execution required.
📑 Table of Contents
- Why pdperf?
- Quick Start
- CI-Friendly Guarantees
- Rules Reference
- Detailed Rule Examples
- CLI Reference
- How pdperf Works — Technical Deep-Dive
- Integrations
- License
🎯 Why pdperf?
Pandas makes it easy to write code that works but scales poorly:
# This works... but is painfully slow on large datasets
for idx, row in df.iterrows():
total += row['price'] * row['quantity']
# pdperf catches this and suggests:
# 💡 Use vectorized: (df['price'] * df['quantity']).sum()
These issues often start in notebooks and quietly move into ETL pipelines. pdperf catches them before production.
⚡ Quick Start
Installation
# PyPI (coming soon)
# pip install pdperf
# Install from source
git clone https://github.com/adwantg/pdperf.git
cd pdperf
pip install -e .
# Or with dev dependencies
pip install -e ".[dev]"
Basic Usage
# Scan a file or directory
pdperf scan your_code.py
pdperf scan src/
# List all available rules
pdperf rules
# Get detailed explanation for a rule
pdperf explain PPO003
Example Output
📄 etl/transform.py
⚠️ 45:12 [PPO001] Avoid df.iterrows() or df.itertuples() in loops; prefer vectorized operations.
💡 Use vectorized column operations like df['a'] + df['b'], or np.where(), merge(), map(), groupby().agg().
❌ 67:8 [PPO003] Building DataFrame via append/concat in a loop is O(n²); accumulate in a list first.
💡 Collect DataFrames in a list, then call pd.concat(frames, ignore_index=True) once after the loop.
📄 features/pipeline.py
⚠️ 23:15 [PPO002] Row-wise df.apply(axis=1) is slow; prefer vectorized operations.
💡 Replace with df['x'] + df['y'], np.where(condition, a, b), Series.map(), or merge().
✅ CI-Friendly Guarantees
- No code execution: pdperf parses code using AST only — safe on any codebase
- Deterministic output: stable ordering by
path → line → col → rule_id - Schema-versioned JSON:
schema_versionfield for tooling stability - Pattern-based detection: doesn't require import resolution or
import pandas as pd
Exit Codes
| Code | Meaning |
|---|---|
0 |
No findings (or --fail-on none) |
1 |
Findings at/above --fail-on threshold |
2 |
Tool error (invalid args, parse error with --fail-on-parse-error) |
JSON Output Schema
{
"schema_version": "1.0",
"tool": "pdperf",
"tool_version": "0.1.0",
"total_findings": 3,
"findings": [
{
"rule_id": "PPO001",
"path": "src/etl.py",
"line": 45,
"col": 12,
"severity": "warn",
"message": "Avoid df.iterrows()...",
"suggested_fix": "Use vectorized..."
}
]
}
📋 Rules Reference
pdperf includes 8 rules targeting the most impactful Pandas performance anti-patterns:
| Rule | Name | Severity | Patchable | Confidence |
|---|---|---|---|---|
| PPO001 | iterrows/itertuples loop | ⚠️ WARN | — | High |
| PPO002 | apply(axis=1) row-wise | ⚠️ WARN | — | High |
| PPO003 | concat/append in loop | ❌ ERROR | — | High |
| PPO004 | chained indexing | ❌ ERROR | 🔧 | High |
| PPO005 | index churn in loop | ⚠️ WARN | — | High |
| PPO006 | .values → .to_numpy() | ⚠️ WARN | 🔧 | High |
| PPO007 | groupby().apply() | ⚠️ WARN | — | Medium |
| PPO008 | string ops in loop | ⚠️ WARN | — | Medium |
Legend:
- 🔧 = Auto-fixable with
--patch - — = Not auto-fixable
- High confidence: Structural AST pattern match (precise)
- Medium confidence: Heuristic-based detection (see rule details for boundaries)
Note: pdperf is import-agnostic by design. In rare cases, non-pandas objects with similar method names (e.g.,
.values) may be flagged. Use--ignoreor--selectto control rules.
📖 Detailed Rule Examples
PPO001: iterrows/itertuples Loop
What it catches:
# ❌ SLOW: Python loop with iterrows
for idx, row in df.iterrows():
result.append(row['a'] * row['b'])
# ❌ SLOW: itertuples is faster but still not ideal
for row in df.itertuples():
result.append(row.a * row.b)
Why it's slow:
- Each row iteration invokes the Python interpreter
iterrows()creates a Series object per row (expensive!)- No vectorization benefits from NumPy's C backend
The fix:
# ✅ FAST: Vectorized operation
result = df['a'] * df['b']
# ✅ FAST: Use numpy for complex operations
result = np.where(df['a'] > 0, df['a'] * df['b'], 0)
PPO002: apply(axis=1) Row-wise Operations
What it catches:
# ❌ SLOW: Row-wise apply with lambda
df['total'] = df.apply(lambda row: row['price'] * row['qty'], axis=1)
# ❌ SLOW: Row-wise apply with custom function
df['category'] = df.apply(categorize_row, axis=1)
Why it's slow:
axis=1processes one row at a time- Python function call overhead for each row
The fix:
# ✅ FAST: Direct vectorized arithmetic
df['total'] = df['price'] * df['qty']
# ✅ FAST: Use np.where for conditionals
df['category'] = np.where(df['value'] > 100, 'high', 'low')
# ✅ FAST: Use np.select for multiple conditions
conditions = [df['value'] > 100, df['value'] > 50]
choices = ['high', 'medium']
df['category'] = np.select(conditions, choices, default='low')
# ✅ FAST: Use map for lookups
df['category'] = df['key'].map(category_mapping)
PPO003: concat/append in Loop (O(n²) Pattern)
What it catches:
# ❌ EXTREMELY SLOW: O(n²) complexity!
df = pd.DataFrame()
for file in files:
chunk = pd.read_csv(file)
df = pd.concat([df, chunk]) # Copies entire df each time!
# ❌ DEPRECATED AND SLOW: df.append (removed in pandas 2.0)
for item in items:
df = df.append({'col': item}, ignore_index=True)
Why it's catastrophic: Each concat copies all existing data. After n iterations: 1 + 2 + 3 + ... + n = O(n²) copies.
⚠️ Note:
DataFrame.append()was deprecated in pandas 1.4.0 and removed in 2.0. See pandas docs.
The fix:
# ✅ FAST: Collect in list, concat once (O(n))
frames = []
for file in files:
chunk = pd.read_csv(file)
frames.append(chunk)
df = pd.concat(frames, ignore_index=True)
# ✅ EVEN FASTER: List comprehension
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
PPO004: Chained Indexing Assignment
What it catches:
# ❌ DANGEROUS: May silently fail!
df[df['a'] > 0]['b'] = 10
# ❌ DANGEROUS: Same pattern with variable
mask = df['a'] > 0
df[mask]['b'] = 10
Why it's dangerous:
df[mask]might return a copy (unpredictable!)['b'] = 10assigns to the copy, not the original- Your data update is silently lost
Pandas warns with SettingWithCopyWarning, but warnings are often ignored. See Real Python's explanation.
The fix:
# ✅ CORRECT: Use .loc for safe assignment
df.loc[df['a'] > 0, 'b'] = 10
# ✅ CORRECT: With named mask
mask = df['a'] > 0
df.loc[mask, 'b'] = 10
PPO005: Index Churn in Loop
What it catches:
# ❌ WASTEFUL: Rebuilds index every iteration
for key in keys:
df = df.reset_index()
df = df.set_index('col')
# ... process ...
Why it matters:
reset_index()andset_index()create new DataFrame copies- Index operations inside loops multiply the overhead
The fix:
# ✅ BETTER: Set index once, outside loop
df = df.set_index('col')
for key in keys:
# ... process without index changes ...
PPO006: .values → .to_numpy()
What it catches:
# ❌ DISCOURAGED: Inconsistent return type
arr = df.values
arr = df['col'].values
Why it matters:
.valuessometimes returns NumPy array, sometimes ExtensionArray- Behavior depends on DataFrame dtypes
.to_numpy()is explicit and always returns NumPy array
📝 Note: Ruff rule PD011 (from pandas-vet) also flags this pattern.
The fix:
# ✅ RECOMMENDED: Explicit conversion
arr = df.to_numpy()
arr = df['col'].to_numpy()
# With explicit dtype
arr = df.to_numpy(dtype='float64', copy=False)
PPO007: Unoptimized groupby().apply()
What it catches:
# ❌ SLOW: Custom function invoked per group
result = df.groupby('category').apply(lambda g: g['value'].sum())
Why it's slow:
apply()invokes Python for each group- Loses vectorization benefits
The fix:
# ✅ FAST: Built-in aggregation
result = df.groupby('category')['value'].sum()
# ✅ FAST: Multiple aggregations with agg()
result = df.groupby('category').agg({
'value': ['sum', 'mean'],
'quantity': 'count'
})
# ✅ FAST: Named aggregations (pandas 0.25+)
result = df.groupby('category').agg(
total=('value', 'sum'),
average=('value', 'mean')
)
Detection boundary: PPO007 flags any
groupby(...).apply(...)call. This is a heuristic — someapply()uses are unavoidable. Use--ignore PPO007if you have legitimate use cases.
PPO008: String Operations in Loop
What it catches:
# ❌ SLOW: String processing in loop
for idx, row in df.iterrows():
df.at[idx, 'name'] = row['name'].lower()
Why it's slow:
- Python string methods called one at a time
- Combined with iterrows overhead
The fix:
# ✅ FAST: Vectorized string operations
df['name'] = df['name'].str.lower()
df['clean'] = df['text'].str.strip().str.replace(' ', ' ', regex=False)
Detection boundary: PPO008 only flags string methods (
.lower(),.strip(), etc.) called on subscript expressions (e.g.,row['col']) inside loops. It does not flag.straccessor usage.
🛠️ CLI Reference
Commands
pdperf scan <path> # Scan files for anti-patterns
pdperf rules # List all rules
pdperf explain <RULE_ID> # Explain a specific rule in detail
Scan Options
| Option | Description | Default |
|---|---|---|
--format |
Output format: text, json, sarif |
text |
--out |
Write output to file | stdout |
--select |
Only check these rules (comma-separated) | all |
--ignore |
Skip these rules (comma-separated) | none |
--severity-threshold |
Minimum severity: warn, error |
warn |
--fail-on |
Exit 1 threshold: warn, error, none |
error |
--fail-on-parse-error |
Exit 2 if any files have syntax errors | false |
--patch |
Generate unified diff for auto-fixable rules | — |
Examples
# Quick check of a single file
pdperf scan etl/transform.py
# Full project scan with JSON output for CI
pdperf scan src/ --format json --out reports/pdperf.json --fail-on error
# Generate SARIF for GitHub Security integration
pdperf scan . --format sarif --out results.sarif
# Focus on critical issues only
pdperf scan . --severity-threshold error --select PPO003,PPO004
# Generate auto-fix patch
pdperf scan . --patch out/fixes.diff
⚙️ Configuration (Planned)
pdperf will support configuration via pyproject.toml:
[tool.pdperf]
select = ["PPO001", "PPO002", "PPO003", "PPO004", "PPO005"]
ignore = ["PPO006"]
severity_threshold = "warn"
fail_on = "error"
format = "json"
🔬 How pdperf Works — Technical Deep-Dive
This section explains the internals of pdperf for curious developers. Whether you're a beginner or an expert, you'll understand exactly how we detect performance anti-patterns.
The Big Picture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Your Code │ ──▶ │ AST Parser │ ──▶ │ Visitors │ ──▶ │ Findings │
│ (.py) │ │ (Python) │ │ (Rules) │ │ (Report) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
In simple terms: pdperf reads your Python code, converts it into a tree structure, walks through that tree looking for patterns that indicate slow code, and reports what it finds.
Step 1: Abstract Syntax Tree (AST) Parsing
What is an AST?
When Python reads your code, it doesn't see text — it sees a tree of instructions. This tree is called an Abstract Syntax Tree (AST).
Example code:
for idx, row in df.iterrows():
total += row['value']
What Python sees (simplified AST):
For
├── target: Tuple(idx, row)
├── iter: Call
│ └── func: Attribute
│ ├── value: Name(df)
│ └── attr: "iterrows"
└── body: [AugAssign...]
Why AST?
| Approach | Pros | Cons |
|---|---|---|
| Regex on text | Simple | Breaks on formatting, comments, strings |
| Running code | Accurate | Dangerous, slow, needs dependencies |
| AST parsing ✅ | Safe, accurate, fast | Requires understanding tree structure |
pdperf uses Python's built-in ast module — the same parser Python itself uses. This means:
- ✅ 100% safe — we never execute your code
- ✅ Handles all Python syntax — even complex expressions
- ✅ Zero false positives from comments/strings — AST ignores them
import ast
# This is what pdperf does internally:
source_code = open("your_file.py").read()
tree = ast.parse(source_code) # Convert text → tree
Step 2: Tree Traversal with the Visitor Pattern
What is the Visitor Pattern?
Instead of manually searching the tree, we use a Visitor — an object that automatically walks through every node in the tree and lets us react to specific node types.
Think of it like a security scanner at an airport:
- The scanner (visitor) checks every bag (node)
- It only alerts on specific items (patterns we care about)
- It doesn't modify anything — just observes
How pdperf implements this:
class PandasPerfVisitor(ast.NodeVisitor):
def visit_For(self, node):
# Called for every 'for' loop in the code
# Check if iterating over iterrows/itertuples
...
def visit_Call(self, node):
# Called for every function call
# Check for concat(), apply(axis=1), etc.
...
Why this is elegant:
- Python automatically walks the entire tree
- We only write code for patterns we care about
- Adding new rules = adding new
visit_Xmethods
Step 3: Context Tracking (Loop Detection)
Many anti-patterns are only problematic inside loops. For example:
pd.concat()outside a loop → ✅ Finepd.concat()inside a loop → ❌ O(n²) performance
How we track loop context:
class PandasPerfVisitor(ast.NodeVisitor):
def __init__(self):
self._loop_stack = [] # Track nested loops
def visit_For(self, node):
self._loop_stack.append(node) # Enter loop
self.generic_visit(node) # Check children
self._loop_stack.pop() # Exit loop
def _in_loop(self):
return len(self._loop_stack) > 0
This enables rules like:
- PPO003:
concatin loop (only flagged when_in_loop() == True) - PPO009:
groupbyin loop - PPO010:
sort_valuesin loop
Step 4: Pattern Matching
Each rule looks for a specific AST pattern. Here's how the most important ones work:
PPO001: iterrows/itertuples Detection
Pattern: A For loop where the iterator is a call to .iterrows() or .itertuples()
def visit_For(self, node):
if isinstance(node.iter, ast.Call):
if isinstance(node.iter.func, ast.Attribute):
if node.iter.func.attr in ("iterrows", "itertuples"):
self._add_finding("PPO001", node)
Visual breakdown:
for idx, row in df.iterrows():
│ └─ Attribute(attr="iterrows")
└── For.iter = Call(func=Attribute...)
PPO003: concat in Loop Detection
Pattern: A call to .concat() or pd.concat() while inside a loop
def visit_Call(self, node):
if self._in_loop(): # Only flag inside loops
if isinstance(node.func, ast.Attribute):
if node.func.attr == "concat":
self._add_finding("PPO003", node)
PPO004: Chained Indexing Detection
Pattern: Assignment where the target is df[x][y] = value
This is tricky because we need to detect nested subscripts on the left side of an assignment:
df[mask]["col"] = value
│ │ │
│ │ └── Subscript (inner)
│ └──────── Subscript (outer)
└─────────── This is the assignment target
def visit_Assign(self, node):
for target in node.targets:
if isinstance(target, ast.Subscript):
if isinstance(target.value, ast.Subscript):
# Nested subscript = chained indexing!
self._add_finding("PPO004", target)
Step 5: Confidence Scoring
Not all detections are equally reliable. pdperf includes a confidence score with each finding:
| Level | Meaning | Example |
|---|---|---|
| High | Structural match, very reliable | iterrows() in for loop |
| Medium | Heuristic, some false positives possible | groupby().apply() |
| Low | Suggestion only | (future rules) |
@dataclass
class Finding:
rule_id: str
confidence: Confidence # HIGH, MEDIUM, LOW
confidence_reason: str # Human-readable explanation
Why this matters:
- CI can filter:
--min-confidence high - Users understand reliability of each finding
- Reduces "alert fatigue" from uncertain warnings
Step 6: Deterministic Output
For CI/CD reliability, pdperf guarantees deterministic output:
# Findings are always sorted by:
findings.sort(key=lambda f: (f.path, f.line, f.col, f.rule_id))
This means:
- Same code → same JSON output
- No flaky CI builds
- Diffs are meaningful
Architecture Summary
┌─────────────────────────────────────────────────────────────┐
│ pdperf │
├─────────────────────────────────────────────────────────────┤
│ cli.py │ Entry point, argument parsing, output │
│ analyzer.py │ AST parsing, visitor, finding creation │
│ rules.py │ Rule definitions, severity, messages │
│ config.py │ pyproject.toml loading, profiles │
│ reporting.py │ JSON, text, SARIF output formatting │
└─────────────────────────────────────────────────────────────┘
| File | Responsibility | Key Classes/Functions |
|---|---|---|
analyzer.py |
Core detection engine | PandasPerfVisitor, Finding, analyze_path |
rules.py |
Rule registry | Rule, Severity, Confidence, RULES dict |
config.py |
Configuration | Config, load_config, PROFILES |
cli.py |
User interface | build_parser, cmd_scan, cmd_explain |
reporting.py |
Output formatting | format_text, write_json, write_sarif |
Algorithms & Complexity
| Operation | Algorithm | Complexity |
|---|---|---|
| AST parsing | Python's built-in parser | O(n) where n = file size |
| Tree traversal | Depth-first visitor | O(nodes) — visits each node once |
| Pattern matching | Direct attribute checks | O(1) per node |
| Finding sorting | Timsort | O(k log k) where k = findings |
Total complexity: O(n) for a single file — linear in code size.
Benchmark: pdperf scans ~10,000 lines/second on typical hardware.
Why This Approach Works
| Design Choice | Benefit |
|---|---|
| AST, not regex | Handles all valid Python syntax correctly |
| Visitor pattern | Clean separation, easy to add rules |
| Loop stack | Context-aware detection (loop vs. not-loop) |
| No type inference | Fast, no dependencies, works on any code |
| Confidence levels | Users trust findings at appropriate level |
| Deterministic output | Reliable CI integration |
Limitations (Honest Assessment)
| Limitation | Why It Exists | Mitigation |
|---|---|---|
| No type inference | Would require running code | Use --ignore for false positives |
| Import-agnostic | Can flag non-pandas .values |
Filter with --select |
| Syntax errors skip file | Can't parse invalid Python | Use --fail-on-parse-error |
| No cross-file analysis | Keeps tool simple and fast | May miss imported patterns |
Extending pdperf
Want to add a new rule? Here's the template:
# 1. Define in rules.py
PPO011 = register_rule(Rule(
rule_id="PPO011",
name="your-rule-name",
severity=Severity.WARN,
message="...",
suggested_fix="...",
confidence=Confidence.HIGH,
))
# 2. Detect in analyzer.py
def visit_Call(self, node):
if self._should_check("PPO011"):
if your_detection_logic(node):
self._add_finding("PPO011", node)
🔌 Integrations
CI: Fail PRs on Errors
pdperf scan . --format json --out pdperf.json --fail-on error
Pre-commit Hook
Add to .pre-commit-config.yaml:
repos:
- repo: local
hooks:
- id: pdperf
name: pdperf (pandas performance linter)
entry: pdperf scan --fail-on error
language: python
types: [python]
GitHub Actions
name: Lint
on: [push, pull_request]
jobs:
pdperf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -e .
- run: pdperf scan src/ --format sarif --out results.sarif --fail-on error
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: results.sarif
✅ Verification
Run Tests
# Install dev dependencies
pip install -e ".[dev]"
pip install pytest
# Run all tests (33 tests)
python -m pytest tests/ -v
Verify Installation
# Check version
pdperf --version
# → pdperf 0.1.0
# List rules (should show 8 rules)
pdperf rules
# Test on example files
pdperf scan examples/
📁 Project Structure
pandas-perf-optimizer/
├── src/pandas_perf_opt/
│ ├── __init__.py # Package version
│ ├── analyzer.py # AST-based detection engine
│ ├── cli.py # Command-line interface
│ ├── reporting.py # JSON/text/SARIF output
│ └── rules.py # Rule definitions & explanations
├── tests/
│ ├── test_rules.py # 33 golden tests
│ └── test_smoke.py # Version test
├── examples/
│ ├── slow_iterrows.py # PPO001 example
│ ├── slow_apply_axis1.py # PPO002 example
│ └── slow_concat_in_loop.py # PPO003 example
├── pyproject.toml # Package configuration
├── Makefile # Dev commands
└── README.md # This file
🔧 Supported Versions
| Dependency | Supported |
|---|---|
| Python | 3.10+ |
| Pandas | 1.5+, 2.x (detection is version-agnostic) |
📚 References
- Pandas Performance Guide — Official pandas performance tips
- SettingWithCopyWarning Explained — Real Python guide
- DataFrame.to_numpy() — Why .to_numpy() over .values
- DataFrame.append() Deprecation — Pandas 1.4+ deprecation notice
- Ruff PD011 — Ruff's
.valuesrule (similar to PPO006)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdperf-0.2.0.tar.gz.
File metadata
- Download URL: pdperf-0.2.0.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ef2b3c6ea2a67d1cea4478910ce66939870a1c2664c05e484c0004247c3458f
|
|
| MD5 |
81c23465beeac0889a771aeaedfb8f7d
|
|
| BLAKE2b-256 |
fdd0029b227a18630b10b78f4ee7ffae5a11e814b6924f411d45e6aa2b47df87
|
File details
Details for the file pdperf-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pdperf-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9174c3fc14574b3086e906240ef9a3ae43b2f882113cc136dbca73d35f857396
|
|
| MD5 |
adfff5a6889988aaf80bcb0793f58b80
|
|
| BLAKE2b-256 |
50f58fac42e1f6c6ea180dca5de747060c3f8fd666b1c63c29c2c5387b2e7e45
|