Concurrent directory tree scanner for Python 3.12+

These details have not been verified by PyPI

Project links

Repository

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.12
Topic
- System :: Filesystems

Project description

dscan

dscan is a concurrent directory scanner for Python 3.12+. It wraps os.scandir in a thread pool with a work-stealing queue, exposing a filtering API that covers most of what you'd otherwise implement by hand on top of os.walk.

Two modes: scan_entries yields raw os.DirEntry objects with minimal overhead; scan yields dataclass models with pre-computed metadata.

Why concurrent scanning?

On a local SSD, directory traversal is fast enough that threading adds more overhead than it saves. scan_entries still matches or edges out os.walk, but the real case for concurrency is network-attached storage.

On SMB shares, NFS mounts, or any high-latency filesystem, each scandir call blocks waiting for a server response. os.walk does this serially — one directory at a time. dscan keeps multiple directories in-flight simultaneously, so workers aren't sitting idle while the network responds. On deep trees with many subdirectories, this compounds significantly.

Windows + SMB: the strongest use case

On Windows, the underlying FindNextFile API returns full file metadata — including size and timestamps — in the same call as the directory listing. This means DirEntry.stat() is effectively free; no additional syscalls are needed to populate a FileEntry model.

This makes scan() model mode on Windows significantly more efficient than on Linux or macOS, where stat requires a separate syscall per entry. The structured output you get from scan() comes at almost no extra cost over scan_entries.

Combined with the concurrency win on high-latency mounts, Windows users scanning SMB network shares or mapped corporate drives get the best of both worlds: concurrent traversal and rich metadata at near-zero overhead. This is the scenario where dscan provides the clearest, most measurable improvement over os.walk.

Recommended for:

Corporate environments with large SMB file servers
NAS devices accessed over Windows network shares
Any mapped drive with deep directory trees

Tuning for high-latency mounts:

# Increase workers to match network latency
for entry in scan("//fileserver/share", max_workers=32):
    print(entry.path)

Benchmarks

Local SSD (~4M entries, MacBook)

	entries	time
`os.walk` (no stat)	4,046,505	33.30s
`os.walk` (+ stat)	4,039,313	85.24s
`dscan.scan_entries`	4,046,502	31.90s
`dscan.scan` (models)	4,014,758	140.15s

scan_entries is on par with bare os.walk. scan is slower because stat calls happen on the main thread serially — the workers parallelise scandir, not stat. Use scan when you want the structured output; use scan_entries when throughput matters.

Note: This benchmark was run on macOS where stat requires a separate syscall per entry. On Windows, scan() performance is substantially better due to FindNextFile bundling metadata. See the Windows + SMB section above.

Simulated network latency (5ms per directory)

# rough simulation
import time, os
_real = os.scandir
os.scandir = lambda p: (time.sleep(0.005), _real(p))[1]

	time
`os.walk`	~linear with directory count
`dscan.scan_entries`	scales with `max_workers`

At 5ms latency per directory, a tree with 10,000 directories takes ~50s serially. With 16 workers dscan brings that to ~4s. The deeper and wider the tree, the bigger the difference.

Installation

pip install dscan

Requires Python 3.12+. No other dependencies.

Usage

Basic scan

from dscan import scan

for entry in scan("."):
    print(f"{entry.name} - {entry.path}")

Raw entries (lower overhead)

from dscan import scan_entries

for entry in scan_entries("~/Documents", max_depth=2):
    if entry.is_file():
        print(entry.name)

Filtering

Extensions

# Only Python and Markdown files
for file in scan(".", extensions={".py", ".md"}):
    print(file.path)

# Skip compiled files
for file in scan(".", ignore_extensions={".bin", ".exe"}):
    print(file.path)

Glob patterns

# Only test files
for entry in scan(".", match="test_*"):
    print(entry.name)

# Skip hidden files and directories
for entry in scan(".", ignore_pattern=".*"):
    print(entry.name)

Directory traversal

# Immediate children only
for entry in scan(".", max_depth=0):
    print(entry.name)

# Only descend into src/ and lib/
for entry in scan(".", only_dirs=["src", "lib"]):
    print(entry.path)

# Skip specific directories
# .git, .idea, .venv, __pycache__ are skipped by default
for entry in scan(".", ignore_dirs=["node_modules", "dist"]):
    print(entry.path)

# Disable all default ignores
for entry in scan(".", ignore_dirs=[]):
    print(entry.path)

Custom filter

def is_large_file(entry):
    return entry.is_file() and entry.stat().st_size > 1_000_000

for entry in scan(".", custom_filter=is_large_file):
    print(entry.name)

Tuning workers

# default is min(32, cpu_count * 2)
# increase on high-latency mounts
for entry in scan_entries("/mnt/nas", max_workers=32):
    print(entry.path)

Data Models

scan() returns FileEntry or DirectoryEntry dataclasses.

`FileEntry`

field	description
`name`	filename without extension
`extension`	lowercase extension, no leading dot
`path`	full path
`dir_path`	containing directory
`size`	bytes
`created_at`	`datetime`
`modified_at`	`datetime`

`DirectoryEntry`

field	description
`name`	directory name
`path`	full path
`parent_path`	parent directory
`created_at`	`datetime`
`modified_at`	`datetime`

vs the stdlib

	`os.walk`	`pathlib.rglob`	`dscan`
Concurrent traversal	No	No	Yes
Built-in models	No	No	Yes
Depth limit	Manual	No	Yes
Directory exclusions	Manual	No	Yes

Roadmap

Move stat into workers — on Linux/macOS over NFS or high-latency mounts, stat is a separate network round-trip per entry, just like scandir. Running stat inside the worker threads would let latency overlap across concurrent workers, significantly improving scan() model performance on those platforms.
getattrlistbulk support (macOS) — macOS exposes a syscall that returns full file attributes (including size and timestamps) for all entries in a single directory call, equivalent to what Windows gets from FindNextFile. Implementing this would bring scan() performance on local macOS disk in line with Windows, and close the current gap between scan() and scan_entries() shown in the benchmarks above.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.12
Topic
- System :: Filesystems

Release history Release notifications | RSS feed

This version

0.1.2

Mar 8, 2026

0.1.1

Mar 8, 2026

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dscanpy-0.1.2.tar.gz (12.7 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dscanpy-0.1.2-py3-none-any.whl (12.3 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file dscanpy-0.1.2.tar.gz.

File metadata

Download URL: dscanpy-0.1.2.tar.gz
Upload date: Mar 8, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.10 Darwin/25.1.0

File hashes

Hashes for dscanpy-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`26fd6aedc6c9344c5bac6007d7f5037a8116a889b7879d40f21cd965787540d0`
MD5	`4bcdfa8543d21921bb26b03f576ed0a8`
BLAKE2b-256	`0909716bbecad2ec7f93cc027978025a480b7f7b2d2f4c8ad49a540a9947e73e`

See more details on using hashes here.

File details

Details for the file dscanpy-0.1.2-py3-none-any.whl.

File metadata

Download URL: dscanpy-0.1.2-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.10 Darwin/25.1.0

File hashes

Hashes for dscanpy-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d117b830bfd04aaf6af2800f92193f7425dc0a8c7f6dcd2146c1c5f4b3c2ed5`
MD5	`501ccf20c8a7283044d764db6b2d844f`
BLAKE2b-256	`5719cbd4914c8be56de7e1210c6aca9a9c444f7dc0efbbe6a5d5fee36effb7bb`

See more details on using hashes here.

dscanpy 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dscan

Why concurrent scanning?

Windows + SMB: the strongest use case

Benchmarks

Local SSD (~4M entries, MacBook)

Simulated network latency (5ms per directory)

Installation

Usage

Basic scan

Raw entries (lower overhead)

Filtering

Extensions

Glob patterns

Directory traversal

Custom filter

Tuning workers

Data Models

FileEntry

DirectoryEntry

vs the stdlib

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`FileEntry`

`DirectoryEntry`