Skip to main content

High-performance parallel filesystem walker

Project description

python-pwalk

PyPI Downloads License Python Version Build Status

A high-performance toolkit for filesystem analysis and reporting — optimized for petabyte-scale filesystems and HPC environments.

Generate comprehensive filesystem metadata reports 5-10x faster than traditional tools, with true multi-threading, intelligent buffering, and automatic zstd compression achieving 23x size reduction. Perfect for system administrators, data scientists, and HPC users working with millions or billions of files.

Why python-pwalk?

Traditional filesystem traversal tools struggle with modern storage systems containing billions of files. python-pwalk solves this with a battle-tested approach combining Python's ease of use with C's raw performance.

Key Features

  • 🚀 Extreme Performance: 8,000-30,000 files/second — traverse 50 million files in ~41 minutes
  • 🔄 True Parallelism: Multi-threaded C implementation using up to 32 threads
  • 🗜️ 23x Compression: Automatic zstd compression reduces 100 GB CSV to 4 GB
  • 📦 Zero Dependencies: No PyArrow, no numpy — just Python + C
  • 🔌 Drop-in Replacement: 100% compatible with os.walk() API
  • 💾 Memory Efficient: Thread-local buffers with automatic spillover for billions of files
  • 🎯 HPC Ready: SLURM-aware, .snapshot filtering, cross-filesystem detection
  • 🦆 DuckDB Native: Output .csv.zst files readable directly by DuckDB
  • 🛡️ Production Tested: Based on John Dey's pwalk used in HPC environments worldwide

Perfect For

  • System Administrators: Audit multi-petabyte filesystems in minutes
  • Data Scientists: Analyze file distributions across massive datasets
  • HPC Users: Track storage usage in supercomputing environments
  • Storage Teams: Generate reports for NetApp, Lustre, GPFS filesystems
  • Compliance: Create auditable records of filesystem contents

Installation

pip install pwalk

That's it! Pre-compiled binary wheels with zstd compression are available for:

  • Linux: x86_64 (manylinux2014)
  • CPython: 3.10, 3.11, 3.12, 3.13, 3.14
  • PyPy: 3.10, 3.11 (fast JIT-compiled Python alternative)
  • Free-threading wheels: Python 3.13t, 3.14t (experimental no-GIL builds)

No system dependencies needed — wheels include everything pre-compiled!

To install for free-threading Python or PyPy:

python3.13t -m pip install pwalk          # Free-threading CPython
PYTHON_GIL=0 python3.13t your_script.py   # Run with GIL disabled
pypy3 -m pip install pwalk                # PyPy (JIT-compiled)

Quick Start

30-Second Example

from pwalk import walk, report

# 1. Drop-in replacement for os.walk() — 100% compatible API
for dirpath, dirnames, filenames in walk('/data'):
    print(f"{dirpath}: {len(filenames)} files")

# 2. Generate compressed filesystem report (5-10x faster with multi-threading!)
output, errors = report('/data', compress='zstd')
# Creates scan.csv.zst - 23x smaller than plain CSV!

# 3. Analyze with DuckDB
import duckdb
df = duckdb.connect().execute(f"SELECT * FROM '{output}'").fetchdf()
print(df.head())

Note on Performance: The walk() function uses os.walk() under the hood (single-threaded) for maximum compatibility across Python versions. For 5-10x faster performance, use report() which leverages our multi-threaded C implementation. In Python 3.13+ with free-threading (no-GIL mode), walk() will automatically use parallel traversal for massive speedups!

Basic Usage

from pwalk import walk

# 100% compatible with os.walk() API
for dirpath, dirnames, filenames in walk('/data'):
    print(f"Directory: {dirpath}")
    print(f"  Subdirectories: {len(dirnames)}")
    print(f"  Files: {len(filenames)}")

Full API Compatibility

from pwalk import walk

# All os.walk() parameters supported
for dirpath, dirnames, filenames in walk(
    '/data',
    topdown=True,           # Process parents before children
    onerror=handle_error,   # Error callback
    followlinks=False       # Don't follow symlinks
):
    # Modify dirnames in-place to prune traversal
    dirnames[:] = [d for d in dirnames if not d.startswith('.')]

Advanced: Thread Control

from pwalk import walk

# Explicit thread count (default: cpu_count() or SLURM_CPUS_ON_NODE)
for dirpath, dirnames, filenames in walk('/data', max_threads=16):
    process_directory(dirpath, filenames)

# Traverse snapshots (disabled by default)
for dirpath, dirnames, filenames in walk('/data', ignore_snapshots=False):
    ...

Filesystem Metadata Reports

CSV Output with Zstd Compression (Default)

from pwalk import report

# Generate compressed CSV (8-10x smaller, DuckDB compatible)
output, errors = report(
    '/data',
    output='scan.csv',
    compress='zstd'  # or 'gzip', 'auto', 'none'
)

print(f"Report saved to: {output}")
print(f"Inaccessible directories: {len(errors)}")

CSV Format (100% compatible with John Dey's pwalk):

inode,parent-inode,directory-depth,"filename","fileExtension",UID,GID,st_size,st_dev,st_blocks,st_nlink,"st_mode",st_atime,st_mtime,st_ctime,pw_fcount,pw_dirsum

Compression Options:

  • compress='auto': Use zstd if available, else uncompressed (default)
  • compress='zstd': Force zstd compression (8-10x, fast)
  • compress='gzip': Use gzip compression (6-7x, slower but universal)
  • compress='none': No compression

DuckDB Analysis Workflow

# 1. Generate compressed CSV report
from pwalk import report
output, errors = report('/data', output='scan.csv', compress='zstd')

# 2. DuckDB reads .csv.zst natively!
import duckdb
con = duckdb.connect('fs_analysis.db')
con.execute("CREATE TABLE fs AS SELECT * FROM 'scan.csv.zst'")

# 3. Answer questions like "Who used the last 10TB?"
result = con.execute("""
    SELECT
        uid,
        count(*) as file_count,
        sum(st_size) / (1024*1024*1024*1024) as size_tb
    FROM fs
    WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)
    GROUP BY uid
    ORDER BY size_tb DESC
    LIMIT 10
""").fetchdf()
print(result)

# 4. Optional: Convert to Parquet for long-term storage
con.execute("""
    COPY (SELECT * FROM 'scan.csv.zst')
    TO 'scan.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY)
""")

Filesystem Repair (Root Only)

from pwalk import repair

# Dry-run to preview changes
repair(
    '/shared',
    dry_run=True,
    change_gids=[1234, 5678],      # Treat these GIDs like private groups
    force_group_writable=True,      # Ensure group read+write+execute
    exclude=['/shared/archive']     # Skip specific paths
)

# Apply changes
repair('/shared', dry_run=False, force_group_writable=True)

Real-World Use Cases

1. Answer Critical Storage Questions Fast

from pwalk import report
import duckdb

# Generate comprehensive filesystem metadata
report('/data', compress='zstd')  # Done in minutes, not hours!

# Who used the last 10 TB this week?
con = duckdb.connect()
result = con.execute("""
    SELECT UID, COUNT(*) as files, SUM(st_size)/1e12 as TB
    FROM 'scan.csv.zst'
    WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)
    GROUP BY UID ORDER BY TB DESC LIMIT 10
""").fetchdf()

2. Find Storage Hogs and Opportunities

# Find directories with >1M files (performance issues!)
huge_dirs = con.execute("""
    SELECT "filename", pw_fcount, pw_dirsum/1e9 as GB
    FROM 'scan.csv.zst'
    WHERE pw_fcount > 1000000
    ORDER BY pw_fcount DESC
""").fetchdf()

# Find ancient files (cleanup candidates)
old_files = con.execute("""
    SELECT "filename", st_size, datetime(st_mtime, 'unixepoch')
    FROM 'scan.csv.zst'
    WHERE st_mtime < unixepoch(now() - INTERVAL 2 YEAR)
    ORDER BY st_size DESC LIMIT 100
""").fetchdf()

3. Monitor Growth Over Time

# Weekly snapshots
import schedule

def weekly_snapshot():
    timestamp = time.strftime('%Y%m%d')
    report('/data', output=f'snapshot_{timestamp}.csv.zst')

schedule.every().sunday.at("02:00").do(weekly_snapshot)

Performance

Multi-Threading Support Matrix

Python Version walk() report()
CPython 3.10
CPython 3.11
CPython 3.12
CPython 3.13
CPython 3.13t (No GIL)
CPython 3.14
CPython 3.14t (No GIL)
PyPy 3.10
PyPy 3.11

Legend: ✅ = Multi-threaded (5-10x faster) | ❌ = Single-threaded

Key Takeaway: Use report() for maximum performance on all Python versions!

Current Performance (Python 3.10-3.14)

walk() function: Uses os.walk() internally (single-threaded) for 100% compatibility

  • Same speed as os.walk() — perfect drop-in replacement
  • No threading overhead, works everywhere

report() function: Multi-threaded C implementation (5-10x faster!)

  • Speed: 8,000-30,000 stat operations per second
  • Example: 50 million files in ~41 minutes at 20K stats/sec
  • Parallelism: Up to 32 threads (releases GIL, works on all Python implementations)
  • Scaling: Performance depends on storage system, host CPU, and file layout
  • Compression: Zstd reduces CSV size by 23x with minimal overhead
  • PyPy Compatible: Multi-threading works on PyPy through C extension (cpyext)

Future Performance (Python 3.13+ Free-Threading)

What's Changing? Python 3.13 introduced optional "free-threading" mode (also called "no-GIL mode").

The Global Interpreter Lock (GIL) Explained: For decades, Python had a "global lock" that prevented multiple threads from running Python code simultaneously. This meant that even with multiple CPU cores, only one thread could execute Python code at a time. Python 3.13+ can optionally remove this lock, allowing true parallel execution.

What This Means for pwalk:

  • CPython 3.13+ with free-threading: walk() will automatically use parallel traversal for 5-10x speedup
  • CPython 3.10-3.12: walk() uses os.walk() (single-threaded)
  • PyPy (all versions): walk() uses os.walk() (single-threaded, but JIT-compiled Python is faster)
  • report() is always fast: Multi-threaded C code releases GIL - works on all implementations!

How to Get Free-Threading Python (Python 3.13+):

Free-threading Python builds are now available! Here's how to get them:

Option 1: Official Python.org Installers (Easiest)

# Download from python.org
# Look for "Free-threaded" builds (separate downloads)
# https://www.python.org/downloads/

# On Linux/macOS, the free-threaded interpreter has a 't' suffix
python3.13t --version  # Should show "Python 3.13.x (free-threaded)"

Option 2: Build from Source (Linux/macOS)

# Clone Python source
git clone https://github.com/python/cpython.git
cd cpython
git checkout v3.13.0  # or latest 3.13.x tag

# Configure with free-threading
./configure --disable-gil
make -j$(nproc)
sudo make install

# Verify
python3.13 --version
python3.13 -c "import sys; print(f'GIL disabled: {not sys._is_gil_enabled()}')"

Option 3: Docker/Conda (Recommended for Testing)

# Using official Python Docker image
docker run -it python:3.13-slim python3 -c "import sys; print(sys._is_gil_enabled())"

# Conda (check for free-threading builds)
conda install python=3.13

Option 4: pyenv (Developers)

# Install pyenv if not already installed
curl https://pyenv.run | bash

# Install free-threaded Python 3.13
pyenv install 3.13.0t  # 't' suffix for free-threaded build
pyenv local 3.13.0t

# Verify
python --version
python -c "import sys; print(f'GIL: {sys._is_gil_enabled()}')"

Using Free-Threading with pwalk:

# Install pwalk
python3.13t -m pip install pwalk

# IMPORTANT: Set PYTHON_GIL=0 to keep GIL disabled
export PYTHON_GIL=0

# Run your script
python3.13t your_script.py

# Or inline:
PYTHON_GIL=0 python3.13t your_script.py

# Verify it's working
PYTHON_GIL=0 python3.13t -c "import sys; print(f'Free-threading: {not sys._is_gil_enabled()}')"

Important: By default, Python 3.13t will enable the GIL when loading C extensions that haven't declared GIL-free compatibility. Use PYTHON_GIL=0 or -Xgil=0 to keep it disabled.

Note: As of 2025, free-threading is still experimental. Some packages may not be compatible yet. For production use today, stick with report() which is always multi-threaded!

Technical Architecture

  • Single Optimized C Extension: _pwalk_core — 320 lines of highly optimized C
  • Thread-Local Buffers: 512KB per thread, zero lock contention during traversal
  • Multithreaded Traversal: Up to 32 parallel threads using John Dey's proven algorithm
  • Streaming Compression: Zstd level 1 for 23x compression at 200-400 MB/s
  • SLURM Integration: Auto-detects SLURM_CPUS_ON_NODE for HPC environments
  • Zero Dependencies: No external Python packages — ships ready to run

Why No Dependencies Matters

Unlike other tools that require PyArrow (~50 MB), numpy, pandas, etc., pwalk installs in seconds with zero dependencies. This means:

  • ✅ Instant installation on air-gapped HPC systems
  • ✅ No version conflicts with existing packages
  • ✅ Minimal attack surface for security-conscious environments
  • ✅ Works everywhere Python 3.10+ works

Contributing

Contributions welcome! Based on the rock-solid filesystem-reporting-tools by John Dey.

License

GPL v2 — Same as John Dey's original pwalk implementation.

Links


Built with ❤️ for system administrators and HPC users worldwide. Based on John Dey's battle-tested pwalk implementation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pwalk-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pwalk-0.1.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (38.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

pwalk-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pwalk-0.1.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (39.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

pwalk-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pwalk-0.1.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (38.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ i686manylinux: glibc 2.5+ i686

File details

Details for the file pwalk-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d307707fc8538de923223dd86ed75177f030bd40ca42ed1d153b421266c21f4e
MD5 ac3c5c938ac2f23e48c260b9746154c1
BLAKE2b-256 6ce8f9fa12bec7e23eb306015c3ec3ce15f593c17205473e93df0c9abc6ad128

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pwalk-0.1.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 0fed75385c428bdc32beefc59f86804efc4add289b81aa79bef6f82197129981
MD5 74b97b14bea59daa49bde2e5a05881a6
BLAKE2b-256 cf556518d0df89d316a431e0ac48c5d6eda821f93f364679a9a80477bc2e21cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pwalk-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 857f64d6d8fef3d400f1d34f81c2007e9ef3f1125429f565edd0b8d34f73c6e0
MD5 07c980c8a27ac25185164eb43d59afe8
BLAKE2b-256 9c83f161cf474f4843b8245fc85c486cfe78e14381daef70e5b2f6de59ebee6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pwalk-0.1.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 4ed7868e4343ee9d57758fc3f47b3a016dff286810165907d4e1a9f5f4d250b4
MD5 569f6a6f92afb8daee7e4688748588d0
BLAKE2b-256 b5914f2fda2a61cc39eff47961e4932c5361802ca58a1467f19c42fa242bfc5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pwalk-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5382ca7447ea561ec1cf56c1af881d221d0582e9911dc947447586ff855babcb
MD5 e7dc5d67d46a257d48c0c8efdd4e766d
BLAKE2b-256 45c0027fe7a35e551f869507518f30f8bf09b27cebdeeef85368665559f36304

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pwalk-0.1.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for pwalk-0.1.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 f73f672242ba6580fc8e61129506dcf2c390665885c22110949e7800463c5744
MD5 4b08dd0291124e876b55daa5fd456e33
BLAKE2b-256 318a004c47dc7100990094c0d483e557856397526e7eca797a02f50559392a98

See more details on using hashes here.

Provenance

The following attestation bundles were made for pwalk-0.1.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl:

Publisher: publish-pypi.yml on dirkpetersen/python-pwalk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page