Concurrent directory tree scanner for Python 3.12+
Project description
dscan
dscan is a concurrent directory scanner for Python 3.12+. It wraps os.scandir in a thread pool with a work-stealing queue, exposing a filtering API that covers most of what you'd otherwise implement by hand on top of os.walk.
Two modes: scan_entries yields raw os.DirEntry objects with minimal overhead; scan yields dataclass models with pre-computed metadata.
Why concurrent scanning?
On a local SSD, directory traversal is fast enough that threading adds more overhead than it saves. scan_entries still matches or edges out os.walk, but the real case for concurrency is network-attached storage.
On SMB shares, NFS mounts, or any high-latency filesystem, each scandir call blocks waiting for a server response. os.walk does this serially — one directory at a time. dscan keeps multiple directories in-flight simultaneously, so workers aren't sitting idle while the network responds. On deep trees with many subdirectories, this compounds significantly.
Benchmarks
Local SSD (~4M entries, MacBook)
| entries | time | |
|---|---|---|
os.walk (no stat) |
4,046,505 | 33.30s |
os.walk (+ stat) |
4,039,313 | 85.24s |
dscan.scan_entries |
4,046,502 | 31.90s |
dscan.scan (models) |
4,014,758 | 140.15s |
scan_entries is on par with bare os.walk. scan is slower because stat calls happen on the main thread serially — the workers parallelise scandir, not stat. Use scan when you want the structured output; use scan_entries when throughput matters.
Simulated network latency (5ms per directory)
# rough simulation
import time, os
_real = os.scandir
os.scandir = lambda p: (time.sleep(0.005), _real(p))[1]
| time | |
|---|---|
os.walk |
~linear with directory count |
dscan.scan_entries |
scales with max_workers |
At 5ms latency per directory, a tree with 10,000 directories takes ~50s serially. With 16 workers dscan brings that to ~4s. The deeper and wider the tree, the bigger the difference.
Installation
pip install dscan
Requires Python 3.12+. No other dependencies.
Usage
Basic scan
from dscan import scan
for entry in scan("."):
print(f"{entry.name} - {entry.path}")
Raw entries (lower overhead)
from dscan import scan_entries
for entry in scan_entries("~/Documents", max_depth=2):
if entry.is_file():
print(entry.name)
Filtering
Extensions
# Only Python and Markdown files
for file in scan(".", extensions={".py", ".md"}):
print(file.path)
# Skip compiled files
for file in scan(".", ignore_extensions={".bin", ".exe"}):
print(file.path)
Glob patterns
# Only test files
for entry in scan(".", match="test_*"):
print(entry.name)
# Skip hidden files and directories
for entry in scan(".", ignore_pattern=".*"):
print(entry.name)
Directory traversal
# Immediate children only
for entry in scan(".", max_depth=0):
print(entry.name)
# Only descend into src/ and lib/
for entry in scan(".", only_dirs=["src", "lib"]):
print(entry.path)
# Skip specific directories
# .git, .idea, .venv, __pycache__ are skipped by default
for entry in scan(".", ignore_dirs=["node_modules", "dist"]):
print(entry.path)
# Disable all default ignores
for entry in scan(".", ignore_dirs=[]):
print(entry.path)
Custom filter
def is_large_file(entry):
return entry.is_file() and entry.stat().st_size > 1_000_000
for entry in scan(".", custom_filter=is_large_file):
print(entry.name)
Tuning workers
# default is min(32, cpu_count * 2)
# increase on high-latency mounts
for entry in scan_entries("/mnt/nas", max_workers=32):
print(entry.path)
Data Models
scan() returns FileEntry or DirectoryEntry dataclasses.
FileEntry
| field | description |
|---|---|
name |
filename without extension |
extension |
lowercase extension, no leading dot |
path |
full path |
dir_path |
containing directory |
size |
bytes |
created_at |
datetime |
modified_at |
datetime |
DirectoryEntry
| field | description |
|---|---|
name |
directory name |
path |
full path |
parent_path |
parent directory |
created_at |
datetime |
modified_at |
datetime |
vs the stdlib
os.walk |
pathlib.rglob |
dscan |
|
|---|---|---|---|
| Concurrent traversal | No | No | Yes |
| Built-in models | No | No | Yes |
| Depth limit | Manual | No | Yes |
| Directory exclusions | Manual | No | Yes |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dscanpy-0.1.0.tar.gz.
File metadata
- Download URL: dscanpy-0.1.0.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.10 Darwin/25.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f15e2458ec2e152b4cfec35f684d566d7ef0c0841e83cb1a0be8395420ffb22f
|
|
| MD5 |
bcce686ce2a995eda407fbdd14d26bbc
|
|
| BLAKE2b-256 |
daae0d2adf423208a0bfc4f4d4f564e14e67c5dcb795be3671fc4003388615b1
|
File details
Details for the file dscanpy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dscanpy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.10 Darwin/25.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9e42d614e79d77f68cfaa2f1a7380cb991b5036d7316d12682294462622a545
|
|
| MD5 |
2da253b8240ccff685cfe3efe6a75edf
|
|
| BLAKE2b-256 |
dbd07f4ddea906ea1be2c4159c511e4eb57fb93f2b88eb52dd5a4b39394d4221
|