High-performance native C recursive file scanner: multi-threaded, terabyte-scale, with CSV/JSON/Tree export, duplicate detection, largest-N report, and regex filtering.

These details have not been verified by PyPI

Project links

Project description

Anscom

High-performance native C recursive file scanner for Python. v1.5.0

MIT Licensed

Multi-threaded · Terabyte-scale · Zero dependencies · Cross-platform

pip install anscom

What it is

Anscom is a Python C extension that scans directories at raw OS speed. It uses direct kernel syscalls (getdents64 on Linux, FindFirstFileW on Windows, readdir/lstat on macOS), a multi-threaded work queue, and per-thread statistics accumulation. It never loads file contents into memory. It never follows symlinks. It never slows down as the filesystem grows.

The result is always a plain Python dict — five keys minimum, more when you ask for them.

import anscom

result = anscom.scan("/mnt/storage")
# → {'total_files': 2841903, 'scan_errors': 0, 'duration_seconds': 1.87,
#    'categories': {...}, 'extensions': {...}}

2.8 million files. 1.87 seconds. 16 threads. No configuration.

What's New in v1.5.0

v1.5.0 is a major feature release — the largest single update since the initial release. Every existing parameter, behavior, and output format from v1.3.0 is fully preserved.

Feature	Parameter	Description
File list return	`return_files=True`	Returns every scanned file as a list of dicts with `path`,`size`,`ext`,`category`,`mtime`
CSV export	`export_csv="out.csv"`	Writes per-file data to a UTF-8 CSV — zero dependencies
Largest-N report	`largest_n=20`	Top N files by size via per-thread min-heap — O(log N) per file, no extra pass
Duplicate detection	`find_duplicates=True`	Groups files by size then CRC32 of first 4KB — returns grouped path lists
Regex filter	`regex_filter="pattern"`	Only counts files whose full path matches the pattern. Uses POSIX `regexec`on Linux/macOS (no GIL); Python `re`fallback on Windows

Performance note: All five features are strictly opt-in. A plain anscom.scan(".") with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.

Migration from v1.3.0

No breaking changes. All v1.3.0 code runs unchanged on v1.5.0. The new parameters all default to off.

# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)

# v1.5.0 — opt into new features as needed
result = anscom.scan(
    "/data",
    silent          = True,
    ignore_junk     = True,
    return_files    = True,   # new
    largest_n       = 20,     # new
    find_duplicates = True,   # new
    export_csv      = "inventory.csv",  # new
)

Installation

pip install anscom

Requires Python 3.6+. Works on Linux, macOS, and Windows.

Windows source builds require the "Desktop development with C++" workload from Visual Studio Build Tools.

No runtime dependencies. Every feature in v1.5.0 works with nothing else installed.

Verify

import anscom
r = anscom.scan(".", silent=True)
print(r["total_files"], "files —", round(r["duration_seconds"], 3), "s")

Quick Start

import anscom

# Default scan — prints live counter + full report
anscom.scan(".")

# Silent scan — just get the dict
result = anscom.scan(".", silent=True)

# Scan a specific path with more depth
result = anscom.scan("/home/user/projects", max_depth=20, silent=True)

# Print the category breakdown
for cat, count in result["categories"].items():
    if count > 0:
        print(f"{cat:20s} {count:>10,}")

Full API Reference

anscom.scan(
    path,                    # str      — required
    max_depth    = 6,        # int
    show_tree    = False,    # bool
    workers      = 0,        # int
    min_size     = 0,        # int
    extensions   = None,     # list[str] | None
    callback     = None,     # callable | None
    silent       = False,    # bool
    ignore_junk  = False,    # bool
    export_json  = None,     # str | None
    export_tree  = None,     # str | None
    return_files = False,    # bool       ← new in v1.5.0
    export_csv   = None,     # str | None ← new in v1.5.0
    largest_n    = 0,        # int        ← new in v1.5.0
    find_duplicates = False, # bool       ← new in v1.5.0
    regex_filter = None,     # str | None ← new in v1.5.0
) -> dict

Return Value

The return value is always a dict. Five keys are always present. Three are added on demand.

Key	Type	Always?	Description
`total_files`	`int`	✓	Files that passed all filters and were categorized
`scan_errors`	`int`	✓	Paths that failed to open (permissions, broken links)
`duration_seconds`	`float`	✓	Wall-clock time from first thread spawn to last join
`categories`	`dict[str, int]`	✓	All 9 categories, always present even if zero
`extensions`	`dict[str, int]`	✓	Only non-zero extension counts
`files`	`list[dict]`	`return_files=True`	Per-file records:`path`,`size`,`ext`,`category`,`mtime`
`largest_files`	`list[dict]`	`largest_n > 0`	Top-N files by size:`path`,`size`
`duplicates`	`list[list[str]]`	`find_duplicates=True`	Groups of paths sharing identical content (size + CRC32)

The nine category keys inside result["categories"]:

"Code/Source"    "Documents"      "Images"         "Videos"
"Audio"          "Archives"       "Executables"    "System/Config"
"Other/Unknown"

All Parameters in Depth

`path`

Type: str — Required

The root directory to scan. Accepts relative paths (., ../data), absolute paths (/mnt/storage, C:\Users), or an empty string (treated as .).

anscom.scan(".")
anscom.scan("/mnt/nas")
anscom.scan("C:\\Users\\Aditya\\Documents")
anscom.scan("")  # same as "."

`max_depth`

Type: int — Default: 6 — Range: [0, 64]

Maximum directory recursion depth. Depth 0 means only the immediate children of path are examined — no subdirectories are entered. Depth 64 is the hard ceiling enforced in C.

# Only the top level — no recursion
anscom.scan("/data", max_depth=0, silent=True)

# Standard project scan
anscom.scan("/project", max_depth=6, silent=True)

# Deep NAS or archive scan
anscom.scan("/mnt/archive", max_depth=30, silent=True)

# Maximum depth — unlimited for practical purposes
anscom.scan("/", max_depth=64, silent=True)

Values below 0 are clamped to 0. Values above 64 are clamped to 64.

`workers`

Type: int — Default: 0

Number of worker threads. 0 auto-detects the hardware CPU count via sysconf(_SC_NPROCESSORS_ONLN) on Linux/macOS and GetSystemInfo() on Windows. If auto-detection fails, falls back to 4.

When show_tree=True, workers is forced to 1 regardless of what is passed — multiple threads writing to stdout would produce interleaved output.

# Auto (recommended for most cases)
anscom.scan("/data", workers=0)

# Pin to a specific count
anscom.scan("/data", workers=8)

# Maximum parallelism on a 64-core machine
anscom.scan("/data", workers=64)

At shallow depths the work queue feeds all threads efficiently. At depth >= 3 each thread recurses inline, so thread count has diminishing returns past ~16 for typical filesystems unless the tree is extremely wide.

`min_size`

Type: int — Default: 0 (no filter)

Skip all files smaller than this many bytes. Files below the threshold are not counted, not categorized, and not included in return_files or export_csv output.

# Only files larger than 1 MB
anscom.scan("/data", min_size=1024 * 1024, silent=True)

# Only files larger than 100 MB
anscom.scan("/mnt/video", min_size=100 * 1024 * 1024, silent=True)

# Only files larger than 1 GB
anscom.scan("/mnt/backup", min_size=1024 ** 3, silent=True)

On Linux, fstatat() is called to retrieve file size only when this filter is active. On Windows, the size is available directly in WIN32_FIND_DATAW at no extra syscall cost.

`extensions`

Type: list[str] | None — Default: None

Extension whitelist. When set, only files whose extension matches one of the listed strings are counted. All other files are silently skipped — they do not appear in counts, categories, files, export_csv, or any other output.

Pass extensions without the leading dot, lowercase.

# Count only Python files
result = anscom.scan("/repo", extensions=["py"], silent=True)

# Count only web code
result = anscom.scan("/project", extensions=["js", "ts", "jsx", "tsx", "css", "html"])

# Count only media
result = anscom.scan("/media", extensions=["mp4", "mkv", "mov", "avi", "mp3", "flac"])

# Count only documents
result = anscom.scan("/docs", extensions=["pdf", "docx", "xlsx", "pptx", "md", "txt"])

Unknown extensions (not in the built-in table) are also excluded when a whitelist is active.

`ignore_junk`

Type: bool — Default: False

When True, the following directories are skipped entirely — no opendir, no syscall, no recursion. The check is a case-insensitive match on the directory basename, at any depth, under any parent.

Skipped directories:

Category	Directories
Version control	`.git` `.svn` `.hg`
IDE metadata	`.idea` `.vscode`
Dependency trees	`node_modules` `bower_components` `site-packages` `.venv` `venv` `env`
Build output	`build` `dist` `target` `__pycache__`
Cache / temp	`temp` `tmp` `.cache` `.pytest_cache` `.mypy_cache`

# Measure dependency bloat
raw   = anscom.scan("/project", ignore_junk=False, silent=True)
clean = anscom.scan("/project", ignore_junk=True,  silent=True)
bloat = raw["total_files"] - clean["total_files"]
print(f"Dependency files: {bloat:,}")

# Fast production audit — skip all junk
result = anscom.scan("/codebase", ignore_junk=True, workers=32, silent=True)

The default is False — Anscom counts everything unless you opt in to exclusions.

`silent`

Type: bool — Default: False

When False (default), Anscom prints:

A live "Scanned files: N ..." counter that updates every 250ms
The full summary report and extension breakdown on completion

When True, all of that is suppressed. The returned dict is always identical regardless of this flag.

silent=True does not suppress tree output from show_tree=True — those are separate.

# For scripting — no output, just the data
result = anscom.scan("/data", silent=True)

# For interactive use — full live output
anscom.scan("/data")

`show_tree`

Type: bool — Default: False

When True, prints a DFS-ordered directory tree to sys.stdout as each entry is discovered. Forces workers=1 to guarantee correct ordering.

  |-- [src]
  |   |   |-- main.py
  |   |   |-- utils.py
  |   |   |-- [tests]
  |   |   |   |   |-- test_main.py
  |-- [docs]
  |   |   |-- readme.md
  |-- config.json

Square brackets [name] indicate a directory
No brackets indicates a regular file
Each depth level adds " | " (6 characters) of indentation

Output is produced one line at a time via PySys_WriteStdout. Any sys.stdout redirect in Python will capture every line. There is no internal buffer — a 50 million file filesystem produces 50+ million lines without accumulating memory.

# Print tree to terminal
anscom.scan(".", show_tree=True, max_depth=4)

# Capture tree in Python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan(".", show_tree=True, max_depth=3, silent=True)
sys.stdout = sys.__stdout__
tree_text = buf.getvalue()

# Save tree to file (see also export_tree)
anscom.scan("/data", show_tree=True, silent=True, export_tree="tree.txt")

`callback`

Type: callable | None — Default: None

A Python callable invoked approximately every 1 second with the current scanned file count as a single int argument. Fired by the progress thread every 4th tick (250ms × 4 = 1000ms).

The GIL is acquired before each call and released immediately after. Scan worker threads are never blocked by callback invocation.

def on_progress(n):
    print(f"\rScanned: {n:,}", end="", flush=True)

result = anscom.scan("/data", callback=on_progress, silent=True)
print()

# Push to Prometheus
from prometheus_client import Gauge
g = Gauge("files_scanned", "Current file scan count")
anscom.scan("/data", callback=lambda n: g.set(n), silent=True)

`export_json`

Type: str | None — Default: None

Path to write the full result dict as a formatted JSON file. Uses Python's built-in json module — no external dependencies. Written with 4-space indentation after the scan completes.

The JSON file contains all keys that are in the returned dict, including optional keys (files, largest_files, duplicates) when those features are enabled in the same call.

anscom.scan("/data", export_json="report.json", silent=True)

# With optional features — JSON gets those keys too
anscom.scan(
    "/data",
    export_json     = "report.json",
    return_files    = True,
    largest_n       = 10,
    find_duplicates = True,
    silent          = True
)

Example output:

{
    "total_files": 21008,
    "scan_errors": 0,
    "duration_seconds": 1.5186,
    "categories": {
        "Code/Source": 5955,
        "Documents": 203,
        "Images": 151,
        "Videos": 0,
        "Audio": 730,
        "Archives": 0,
        "Executables": 0,
        "System/Config": 5707,
        "Other/Unknown": 8992
    },
    "extensions": {
        "py": 5955,
        "pyc": 5707,
        "mp3": 730,
        "txt": 160,
        "png": 151
    }
}

`export_tree`

Type: str | None — Default: None

Path to write the tree output to a text file. Only active when show_tree=True.

The file is written incrementally — each line is written and flushed as it is produced. For a filesystem with 50 million entries this produces a multi-gigabyte file without accumulating any output in memory. stdout and the file both receive every line simultaneously.

anscom.scan(
    "/mnt/storage",
    show_tree   = True,
    export_tree = "filesystem_tree.txt",
    silent      = True,
    max_depth   = 64
)

`export_csv`

Type: str | None — Default: None — New in v1.5.0

Path to write a per-file inventory as a UTF-8 CSV. Columns: path, size, ext, category, mtime.

path: full absolute path, RFC 4180-quoted (double-quoted, inner quotes doubled)
size: file size in bytes as an integer
ext: lowercase extension without the dot (empty string for unrecognized extensions)
category: one of the 9 category names
mtime: Unix timestamp (seconds since epoch) of last modification

anscom.scan("/data", export_csv="inventory.csv", silent=True)

Loading the CSV downstream:

# With pandas
import pandas as pd
df = pd.read_csv("inventory.csv")
print(df.groupby("category")["size"].sum().sort_values(ascending=False))

# Convert to Excel
df.to_excel("report.xlsx", index=False)

# Standard library only
import csv
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        print(row["path"], row["size"])

# With openpyxl directly
import csv, openpyxl
wb = openpyxl.Workbook()
ws = wb.active
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        ws.append(row)
wb.save("report.xlsx")

`return_files`

Type: bool — Default: False — New in v1.5.0

When True, the result dict gains a "files" key containing a Python list of dicts — one entry per scanned file.

Each dict has five fields:

Field	Type	Description
`path`	`str`	Full absolute path to the file
`size`	`int`	File size in bytes
`ext`	`str`	Lowercase extension (no dot), empty string if unrecognized
`category`	`str`	One of the 9 category names
`mtime`	`int`	Unix timestamp of last modification

result = anscom.scan("/project", return_files=True, silent=True)

# Iterate
for f in result["files"]:
    print(f["path"], f["size"], f["category"])

# Filter in Python
large_code = [
    f for f in result["files"]
    if f["category"] == "Code/Source" and f["size"] > 50_000
]

# Sort by size descending
by_size = sorted(result["files"], key=lambda f: f["size"], reverse=True)
print("Largest file:", by_size[0]["path"])

# Group by extension
from collections import defaultdict
by_ext = defaultdict(list)
for f in result["files"]:
    by_ext[f["ext"]].append(f)

len(result["files"]) == result["total_files"] is always true.

`largest_n`

Type: int — Default: 0 (disabled) — New in v1.5.0

When > 0, finds the top N files by size across the entire scanned filesystem. Uses a per-thread min-heap of capacity N — O(log N) per file, no extra pass, no sorting of the full file list. After all threads join, per-thread heaps are merged and sorted descending.

The result dict gains a "largest_files" key where each entry is a dict with path (str) and size (int).

result = anscom.scan("/mnt/storage", largest_n=20, silent=True)

for f in result["largest_files"]:
    gb = f["size"] / (1024 ** 3)
    print(f"{gb:8.2f} GB  {f['path']}")

The printed report also gains a section:

=== TOP 20 LARGEST FILES ===========================
  1073741824 bytes : /data/backup/archive.tar.gz
   536870912 bytes : /data/media/4k_reel.mkv
...
===================================================

# Find the single largest file
result = anscom.scan("/mnt/nas", largest_n=1, silent=True)
top = result["largest_files"][0]
print(f"Largest: {top['path']} ({top['size']:,} bytes)")

# Top 100 across a petabyte volume
result = anscom.scan("/mnt/petabyte", largest_n=100, workers=64, silent=True)

`find_duplicates`

Type: bool — Default: False — New in v1.5.0

When True, detects duplicate files using a two-phase algorithm:

Size bucketing — all files sorted by size. Files with a unique size are skipped entirely — zero I/O.
CRC32 fingerprinting — for each same-size group (≥2 files, non-zero size), the first 4096 bytes of each file are read and CRC32 is computed. Files with matching CRC32 are reported as duplicates.

The result dict gains a "duplicates" key: a list of groups, each group being a list of path strings. Every group has at least 2 members.

result = anscom.scan("/media-library", find_duplicates=True, silent=True)

print(f"Duplicate groups: {len(result['duplicates'])}")

for group in result["duplicates"]:
    print(f"\nDuplicate set ({len(group)} files):")
    for path in group:
        print(f"  {path}")

Calculating reclaimable space (combine with return_files=True):

result = anscom.scan(
    "/mnt/archive",
    find_duplicates = True,
    return_files    = True,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}

wasted = sum(
    sum(size_map.get(p, 0) for p in group[1:])   # keep 1, discard rest
    for group in result["duplicates"]
)

print(f"Reclaimable: {wasted / (1024**3):.2f} GB across {len(result['duplicates'])} groups")

The printed report adds:

=== DUPLICATES SUMMARY ============================
Groups found : 142
===================================================

`regex_filter`

Type: str | None — Default: None — New in v1.5.0

A regular expression pattern. When set, only files whose full absolute path matches the pattern are counted, categorized, and included in any file-tracking output (return_files, export_csv, find_duplicates, largest_n).

Platform behavior:

Linux / macOS: Compiled with POSIX regcomp(REG_EXTENDED | REG_NOSUB), matched with regexec — no GIL acquisition , runs fully in C inside the worker threads.
Windows: Falls back to Python's re module (GIL acquired per file). For large scans on Windows, prefer extensions whitelist filtering which has zero GIL cost.

The pattern is also compiled with Python's re.compile before the scan starts. An invalid pattern raises ValueError immediately.

# Only .py files anywhere under a tests/ directory
result = anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)

# Only files in directories named 'src'
result = anscom.scan("/project", regex_filter=r"/src/", silent=True)

# Only Python test files
result = anscom.scan("/repo", regex_filter=r"test_.*\.py$", silent=True)
print(f"Test files: {result['total_files']}")

# Invalid patterns raise ValueError immediately — no scan is started
try:
    anscom.scan("/data", regex_filter=r"[invalid(")
except ValueError as e:
    print(e)  # Failed to compile regex_filter.

Export Features

All export parameters are independent and combinable. A single scan pass can write to all simultaneously — one traversal, multiple outputs, no re-scanning.

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)
# One scan. Four output files. Full in-memory results.

Parameter	Format	Dependencies	Notes
`export_json`	JSON	None (built-in)	Full result dict including optional keys
`export_csv`	CSV	None (built-in)	Per-file: path, size, ext, category, mtime
`export_tree`	Plain text	Requires `show_tree=True`	Written line-by-line, safe at any scale

Tree Mode

# Basic tree to terminal
anscom.scan(".", show_tree=True)

# Tree saved to file
anscom.scan("/project", show_tree=True, export_tree="tree.txt", silent=True)

# Deep tree, no terminal output
import sys, io
sys.stdout = io.StringIO()
anscom.scan("/mnt/volume", show_tree=True, export_tree="tree.txt", max_depth=64)
sys.stdout = sys.__stdout__

Output format:

  |-- [src]            ← [brackets] = directory
  |   |   |-- main.py  ← no brackets = regular file
  |   |   |-- [lib]
  |   |   |   |   |-- utils.py
  |-- config.json
  |-- [tests]
  |   |   |-- test_core.py

One " | " block per depth level (6 chars each)
At depth 64: 384 characters of indentation — all structurally valid
DFS order is strict: every file inside a directory appears before that directory's sibling
workers is forced to 1 — required for correct ordering
No internal buffer — safe at 50+ million entries

Exclusion Filter

ignore_junk=True skips these directory names at any depth, at any nesting level:

.git         .svn         .hg          .idea        .vscode
node_modules bower_components site-packages .venv   venv
env          build         dist          target      __pycache__
temp         tmp           .cache        .pytest_cache .mypy_cache

The check is case-insensitive basename comparison — not a path substring match. A node_modules at /project/frontend/node_modules/ is caught regardless of nesting depth.

Report Format

Printed to sys.stdout when silent=False (the default).

Anscom Enterprise v1.5.0 (Threads: 16)
Target: /data

Scanned files: 21008 ...

=== SUMMARY REPORT ================================
+-----------------+--------------+----------+
| Category        | Count        | Percent  |
+-----------------+--------------+----------+
| Code/Source     |         5955 |   28.34% |
| System/Config   |         5707 |   27.16% |
| Other/Unknown   |         8992 |   42.81% |
| Documents       |          203 |    0.97% |
| Images          |          151 |    0.72% |
+-----------------+--------------+----------+
| TOTAL FILES     |        21008 |  100.00% |
+-----------------+--------------+----------+

=== DETAILED EXTENSION BREAKDOWN ==================
+-----------------+--------------+
| Extension       | Count        |
+-----------------+--------------+
| .py             |         5955 |
| .pyc            |         5707 |
| .mp3            |          730 |
| .txt            |          160 |
| .png            |          151 |
+-----------------+--------------+

Time     : 1.5186 seconds
Errors   : 0 (permission denied / inaccessible)
===================================================

=== TOP 20 LARGEST FILES ===========================   ← only with largest_n > 0
  1073741824 bytes : /data/backup/full.tar.gz
...

=== DUPLICATES SUMMARY ============================   ← only with find_duplicates=True
Groups found : 142
===================================================

Capture programmatically:

import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan("/data")
sys.stdout = sys.__stdout__
report_text = buf.getvalue()

File Categories and Extensions

170+ extensions across 9 categories. The table is sorted lexicographically and validated at module init — if the sort invariant is violated, import anscom raises RuntimeError.

Category	Sample Extensions
Code/Source	`c` `cpp` `cs` `go` `h` `html` `java` `js` `json` `jsx` `kt` `lua` `php` `py` `r` `rb` `rs` `sh` `sql` `swift` `ts` `vue` `xml` `yaml` `yml`
Documents	`csv` `doc` `docx` `epub` `md` `mobi` `odp` `ods` `odt` `pdf` `ppt` `pptx` `rst` `rtf` `txt` `xls` `xlsx`
Images	`ai` `avif` `bmp` `gif` `heic` `ico` `jpeg` `jpg` `png` `psd` `raw` `svg` `tiff` `webp`
Videos	`avi` `flv` `mkv` `mov` `mp4` `mpeg` `ogv` `webm` `wmv`
Audio	`aac` `flac` `m4a` `mid` `mp3` `ogg` `wav` `wma`
Archives	`7z` `bz2` `deb` `dmg` `gz` `iso` `jar` `rar` `tar` `tgz` `zip`
Executables	`app` `bin` `class` `dll` `elf` `exe` `msi` `pyd` `so`
System/Config	`bak` `cfg` `conf` `db` `env` `gitignore` `ini` `log` `pyc` `reg` `sys` `tmp` `ttf` `woff`
Other/Unknown	Any extension not in the above table

Architecture

OS backends

Three separate scanning implementations compiled and selected at build time:

Platform	Backend	Mechanism
Linux	`getdents64`	Direct `syscall(SYS_getdents64, dirfd, buf, 131072)`— raw kernel ABI, 128KB read buffer,`d_type`for zero-stat type detection
Windows	`FindFirstFileW`	Wide-char `wchar_t`paths, UTF-16→UTF-8 conversion, size+mtime from `WIN32_FIND_DATAW`at no extra syscall cost
macOS / BSD	POSIX `readdir`	`opendir`/`readdir`with `lstat`for type resolution

Thread model

main thread
  ├── spawn N worker threads (all waiting on cond var)
  ├── spawn 1 progress thread
  ├── push root path to queue
  ├── wait until queue.count == 0 && active_workers == 0
  └── join all threads → merge stats

worker thread (×N)
  └── loop: queue_pop → process_dir_recursive → queue_task_done

process_dir_recursive
  ├── depth < 3: push subdirs to queue (parallel pickup by idle threads)
  └── depth ≥ 3: recurse inline (avoids queue overhead for deep narrow trees)

Per-thread stats — zero locks during counting

Each thread has its own ScanStats struct with ext_counts[170+] and cat_counts[9]. No lock is acquired during file categorization. The only shared atomic write per file is a single __sync_fetch_and_add for the progress counter. Stats are merged in one serial pass after all threads join.

Slab path allocator

Each thread allocates (max_depth + 2) * PATH_MAX bytes once before scanning. Path strings during traversal are written into slab[depth * PATH_MAX] via snprintf. Zero heap allocation during traversal.

Extension hash table

512-slot open-addressing hash table with FNV-1a hash and linear probing. Built once at module init from the sorted extension table. O(1) average lookup, no heap allocation, never modified after init.

FileArray pre-allocation

When return_files, export_csv, or find_duplicates is enabled, each thread pre-allocates a FileInfo array of 65,536 entries before scanning begins. Growth beyond that doubles via realloc. For typical filesystems: zero reallocations during the scan.

Min-heap for largest_n

Each thread maintains a min-heap of capacity N. Per-file cost: O(log N) comparison, no lock. Thread heaps merged globally after join using the same push logic.

Two-phase duplicate detection

qsort all files by size — O(M log M), no I/O
For each same-size group ≥2 members: read first 4KB of each, compute CRC32, sort by CRC32, group consecutive matches

Zero I/O for unique-size files. One bounded read per candidate. CRC32 computed using a fully inlined lookup table — no external library.

Security and Compliance

Property	Guarantee
No file contents read	Only directory entries and metadata. Exception:`find_duplicates=True`reads up to 4KB per candidate — bounded, opt-in, read-only
Symlinks never followed	Linux:`fstatat(AT_SYMLINK_NOFOLLOW)`. POSIX:`lstat`. Windows:`FILE_ATTRIBUTE_REPARSE_POINT`skipped unconditionally
Depth hard-capped at 64	Enforced in C at the top of every `process_dir_recursive`call — cannot be bypassed by filesystem topology
All path assembly bounded	`snprintf(slab, PATH_MAX, ...)`— always null-terminated, always within `PATH_MAX`bytes
Errors counted, not silenced	Every failed `opendir`/`open`/`FindFirstFileW`increments `scan_errors`and continues — the final count is exact
Work queue bounded	131,072 fixed slots. Overflow falls back to inline recursion — no unbounded allocation
Hash table immutable after init	Built once at module load. No runtime modification
Zero external dependencies	No mandatory third-party packages — no supply chain surface

Enterprise Recipes

Storage cost allocation

import anscom

result = anscom.scan("/mnt/nas", workers=16, ignore_junk=True, silent=True)
total = result["total_files"]
cats  = result["categories"]

media = cats["Videos"] + cats["Images"] + cats["Audio"]
code  = cats["Code/Source"]
docs  = cats["Documents"]

print(f"Media   : {media:>10,}  ({media/total*100:5.1f}%)")
print(f"Code    : {code:>10,}  ({code/total*100:5.1f}%)")
print(f"Docs    : {docs:>10,}  ({docs/total*100:5.1f}%)")
print(f"Total   : {total:>10,}  in {result['duration_seconds']:.2f}s")

Pre-migration audit

import anscom

result = anscom.scan(
    "/legacy-server/data",
    max_depth    = 30,
    silent       = True,
    return_files = True,
    export_json  = "audit.json",
    export_csv   = "inventory.csv"
)

print(f"Recorded {result['total_files']:,} files")
print(f"Errors  : {result['scan_errors']}")

CI/CD policy gate

import anscom, sys

result = anscom.scan("./repo", silent=True, ignore_junk=True)

violations = []
if result["categories"]["Executables"] > 0:
    violations.append(f"{result['categories']['Executables']} executable files")
if result["categories"]["Videos"] > 0:
    violations.append(f"{result['categories']['Videos']} video files")

if violations:
    for v in violations:
        print(f"POLICY VIOLATION: {v}")
    sys.exit(1)

print("File composition check passed.")

Storage reclamation

import anscom

result = anscom.scan(
    "/mnt/media-archive",
    find_duplicates = True,
    return_files    = True,
    workers         = 16,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}
wasted   = sum(
    sum(size_map.get(p, 0) for p in group[1:])
    for group in result["duplicates"]
)

print(f"Duplicate groups : {len(result['duplicates'])}")
print(f"Reclaimable      : {wasted / (1024**3):.2f} GB")

groups_by_waste = sorted(
    result["duplicates"],
    key=lambda g: sum(size_map.get(p, 0) for p in g[1:]),
    reverse=True
)
for group in groups_by_waste[:5]:
    waste = sum(size_map.get(p, 0) for p in group[1:])
    print(f"\n  {waste / (1024**2):.1f} MB wasted:")
    for path in group:
        print(f"    {path}")

Top-100 largest files

import anscom

result = anscom.scan("/mnt/storage", largest_n=100, workers=32, silent=True)

total_gb = sum(f["size"] for f in result["largest_files"]) / (1024**3)
print(f"Top 100 total: {total_gb:.1f} GB\n")

for i, f in enumerate(result["largest_files"][:10], 1):
    print(f"{i:3}. {f['size']/1024**3:8.2f} GB  {f['path']}")

Regex scan — test files only

import anscom
from collections import Counter
import os

result = anscom.scan(
    "/codebase",
    regex_filter = r"/tests?/.*\.py$",
    return_files = True,
    silent       = True
)

print(f"Test files: {result['total_files']}")

dirs = Counter(os.path.dirname(f["path"]) for f in result["files"])
for d, count in dirs.most_common(10):
    print(f"  {count:4d}  {d}")

Live Prometheus push

import anscom
from prometheus_client import Gauge, start_http_server

start_http_server(9090)
g_progress = Gauge("anscom_files_scanned",   "Files scanned so far")
g_total    = Gauge("anscom_total_files",      "Total files found")
g_duration = Gauge("anscom_duration_seconds", "Scan duration")

result = anscom.scan(
    "/data-lake",
    callback = lambda n: g_progress.set(n),
    silent   = True,
    workers  = 32
)

g_total.set(result["total_files"])
g_duration.set(result["duration_seconds"])

Full audit — everything at once

import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")
print("Written      : audit.json  inventory.csv  tree.txt")

Changelog

v1.5.0 (current)

Added return_files — per-file list in result dict with path, size, ext, category, mtime
Added export_csv — per-file inventory as UTF-8 CSV, zero dependencies, RFC 4180-compliant quoting
Added largest_n — top-N files by size using per-thread min-heap, O(log N) per file
Added find_duplicates — size-bucket + CRC32 duplicate detection, zero I/O for unique-size files
Added regex_filter — path pattern filter; POSIX regexec on Linux/macOS (no GIL), Python re fallback on Windows
Added FILEARRAY_INIT_CAP (65536) pre-allocation per thread — zero reallocations for typical scans
Fixed fstatat on Linux called only when needed — two separate guards for type resolution vs. size/mtime collection
Fixed sorted_top paths are strdup'd independently from global_heap — no lifetime overlap, no double-free
Removed export_excel — was crashing on Windows due to openpyxl Workbook.read_only exception; use export_csv + pandas.to_excel() instead
Improved Full docstring on anscom.scan accessible via help(anscom.scan)

v1.3.0

Added export_json, export_excel, export_tree
Fixed DFS tree output ordering
Added file tracking in tree mode

v1.2.0 and earlier

Multi-threaded worker pool with condition-variable termination detection
getdents64 direct syscall on Linux
Per-thread statistics, zero shared state during scan
ignore_junk, min_size, extensions, callback, silent

License

MIT License. Free for personal and commercial use.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.0

Apr 9, 2026

1.4.0

Apr 8, 2026

1.3.0

Mar 15, 2026

1.0.0

Mar 12, 2026

0.6.0

Jan 14, 2026

0.5.0

Jan 14, 2026

0.4.0

Jan 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anscom-1.5.0.tar.gz (56.8 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anscom-1.5.0-cp313-cp313-win_amd64.whl (32.8 kB view details)

Uploaded Apr 9, 2026 CPython 3.13Windows x86-64

File details

Details for the file anscom-1.5.0.tar.gz.

File metadata

Download URL: anscom-1.5.0.tar.gz
Upload date: Apr 9, 2026
Size: 56.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for anscom-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`74af1b2a8939f8209daa9c60a8c3d539b1dd0044a4be0ac1f1ae525c44c96a7c`
MD5	`82eb3ebbff07cbe40dc235560fd71806`
BLAKE2b-256	`e0ecc1e4b46b97ee975c88cb88f001c35f4895e6f42390581e86f5cbd36a64d7`

See more details on using hashes here.

File details

Details for the file anscom-1.5.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: anscom-1.5.0-cp313-cp313-win_amd64.whl
Upload date: Apr 9, 2026
Size: 32.8 kB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for anscom-1.5.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`3fb31007d96e46420bb02b43636b3fbf944cf99bf841f12c618cc794bb246184`
MD5	`3785a185f73d6da2c50be351f9458b42`
BLAKE2b-256	`3c8857d8bca5db03765ae5129c2842277f32279193a312276b2e0916771c1412`

See more details on using hashes here.

anscom 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Anscom

What it is

What's New in v1.5.0

Migration from v1.3.0

Table of Contents

Installation

Verify

Quick Start

Full API Reference

Return Value

All Parameters in Depth

path

max_depth

workers

min_size

extensions

ignore_junk

silent

show_tree

callback

export_json

export_tree

export_csv

return_files

largest_n

find_duplicates

regex_filter

Export Features

Tree Mode

Exclusion Filter

Report Format

File Categories and Extensions

Architecture

OS backends

Thread model

Per-thread stats — zero locks during counting

Slab path allocator

Extension hash table

FileArray pre-allocation

Min-heap for largest_n

Two-phase duplicate detection

Security and Compliance

Enterprise Recipes

Storage cost allocation

Pre-migration audit

CI/CD policy gate

Storage reclamation

Top-100 largest files

Regex scan — test files only

Live Prometheus push

Full audit — everything at once

Changelog

v1.5.0 (current)

v1.3.0

v1.2.0 and earlier

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

`path`

`max_depth`

`workers`

`min_size`

`extensions`

`ignore_junk`

`silent`

`show_tree`

`callback`

`export_json`

`export_tree`

`export_csv`

`return_files`

`largest_n`

`find_duplicates`

`regex_filter`