Skip to main content

High-performance native C recursive file scanner: multi-threaded, terabyte-scale, with CSV/JSON/Tree export, duplicate detection, largest-N report, and regex filtering.

Project description

Anscom

High-performance native C recursive file scanner for Python. v1.5.0

MIT Licensed

Multi-threaded · Terabyte-scale · Zero dependencies · Cross-platform

pip install anscom

What it is

Anscom is a Python C extension that scans directories at raw OS speed. It uses direct kernel syscalls (getdents64 on Linux, FindFirstFileW on Windows, readdir/lstat on macOS), a multi-threaded work queue, and per-thread statistics accumulation. It never loads file contents into memory. It never follows symlinks. It never slows down as the filesystem grows.

The result is always a plain Python dict — five keys minimum, more when you ask for them.

import anscom

result = anscom.scan("/mnt/storage")
# → {'total_files': 2841903, 'scan_errors': 0, 'duration_seconds': 1.87,
#    'categories': {...}, 'extensions': {...}}

2.8 million files. 1.87 seconds. 16 threads. No configuration.


What's New in v1.5.0

v1.5.0 is a major feature release — the largest single update since the initial release. Every existing parameter, behavior, and output format from v1.3.0 is fully preserved.

Feature Parameter Description
File list return return_files=True Returns every scanned file as a list of dicts with path,size,ext,category,mtime
CSV export export_csv="out.csv" Writes per-file data to a UTF-8 CSV — zero dependencies
Largest-N report largest_n=20 Top N files by size via per-thread min-heap — O(log N) per file, no extra pass
Duplicate detection find_duplicates=True Groups files by size then CRC32 of first 4KB — returns grouped path lists
Regex filter regex_filter="pattern" Only counts files whose full path matches the pattern. Uses POSIX regexecon Linux/macOS (no GIL); Python refallback on Windows

Performance note: All five features are strictly opt-in. A plain anscom.scan(".") with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.

Migration from v1.3.0

No breaking changes. All v1.3.0 code runs unchanged on v1.5.0. The new parameters all default to off.

# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)

# v1.5.0 — opt into new features as needed
result = anscom.scan(
    "/data",
    silent          = True,
    ignore_junk     = True,
    return_files    = True,   # new
    largest_n       = 20,     # new
    find_duplicates = True,   # new
    export_csv      = "inventory.csv",  # new
)

Table of Contents


Installation

pip install anscom

Requires Python 3.6+. Works on Linux, macOS, and Windows.

Windows source builds require the "Desktop development with C++" workload from Visual Studio Build Tools.

No runtime dependencies. Every feature in v1.5.0 works with nothing else installed.

Verify

import anscom
r = anscom.scan(".", silent=True)
print(r["total_files"], "files —", round(r["duration_seconds"], 3), "s")

Quick Start

import anscom

# Default scan — prints live counter + full report
anscom.scan(".")

# Silent scan — just get the dict
result = anscom.scan(".", silent=True)

# Scan a specific path with more depth
result = anscom.scan("/home/user/projects", max_depth=20, silent=True)

# Print the category breakdown
for cat, count in result["categories"].items():
    if count > 0:
        print(f"{cat:20s} {count:>10,}")

Full API Reference

anscom.scan(
    path,                    # str      — required
    max_depth    = 6,        # int
    show_tree    = False,    # bool
    workers      = 0,        # int
    min_size     = 0,        # int
    extensions   = None,     # list[str] | None
    callback     = None,     # callable | None
    silent       = False,    # bool
    ignore_junk  = False,    # bool
    export_json  = None,     # str | None
    export_tree  = None,     # str | None
    return_files = False,    # bool       ← new in v1.5.0
    export_csv   = None,     # str | None ← new in v1.5.0
    largest_n    = 0,        # int        ← new in v1.5.0
    find_duplicates = False, # bool       ← new in v1.5.0
    regex_filter = None,     # str | None ← new in v1.5.0
) -> dict

Return Value

The return value is always a dict. Five keys are always present. Three are added on demand.

Key Type Always? Description
total_files int Files that passed all filters and were categorized
scan_errors int Paths that failed to open (permissions, broken links)
duration_seconds float Wall-clock time from first thread spawn to last join
categories dict[str, int] All 9 categories, always present even if zero
extensions dict[str, int] Only non-zero extension counts
files list[dict] return_files=True Per-file records:path,size,ext,category,mtime
largest_files list[dict] largest_n > 0 Top-N files by size:path,size
duplicates list[list[str]] find_duplicates=True Groups of paths sharing identical content (size + CRC32)

The nine category keys inside result["categories"]:

"Code/Source"    "Documents"      "Images"         "Videos"
"Audio"          "Archives"       "Executables"    "System/Config"
"Other/Unknown"

All Parameters in Depth

path

Type: strRequired

The root directory to scan. Accepts relative paths (., ../data), absolute paths (/mnt/storage, C:\Users), or an empty string (treated as .).

anscom.scan(".")
anscom.scan("/mnt/nas")
anscom.scan("C:\\Users\\Aditya\\Documents")
anscom.scan("")  # same as "."

max_depth

Type: intDefault: 6Range: [0, 64]

Maximum directory recursion depth. Depth 0 means only the immediate children of path are examined — no subdirectories are entered. Depth 64 is the hard ceiling enforced in C.

# Only the top level — no recursion
anscom.scan("/data", max_depth=0, silent=True)

# Standard project scan
anscom.scan("/project", max_depth=6, silent=True)

# Deep NAS or archive scan
anscom.scan("/mnt/archive", max_depth=30, silent=True)

# Maximum depth — unlimited for practical purposes
anscom.scan("/", max_depth=64, silent=True)

Values below 0 are clamped to 0. Values above 64 are clamped to 64.


workers

Type: intDefault: 0

Number of worker threads. 0 auto-detects the hardware CPU count via sysconf(_SC_NPROCESSORS_ONLN) on Linux/macOS and GetSystemInfo() on Windows. If auto-detection fails, falls back to 4.

When show_tree=True, workers is forced to 1 regardless of what is passed — multiple threads writing to stdout would produce interleaved output.

# Auto (recommended for most cases)
anscom.scan("/data", workers=0)

# Pin to a specific count
anscom.scan("/data", workers=8)

# Maximum parallelism on a 64-core machine
anscom.scan("/data", workers=64)

At shallow depths the work queue feeds all threads efficiently. At depth >= 3 each thread recurses inline, so thread count has diminishing returns past ~16 for typical filesystems unless the tree is extremely wide.


min_size

Type: intDefault: 0 (no filter)

Skip all files smaller than this many bytes. Files below the threshold are not counted, not categorized, and not included in return_files or export_csv output.

# Only files larger than 1 MB
anscom.scan("/data", min_size=1024 * 1024, silent=True)

# Only files larger than 100 MB
anscom.scan("/mnt/video", min_size=100 * 1024 * 1024, silent=True)

# Only files larger than 1 GB
anscom.scan("/mnt/backup", min_size=1024 ** 3, silent=True)

On Linux, fstatat() is called to retrieve file size only when this filter is active. On Windows, the size is available directly in WIN32_FIND_DATAW at no extra syscall cost.


extensions

Type: list[str] | NoneDefault: None

Extension whitelist. When set, only files whose extension matches one of the listed strings are counted. All other files are silently skipped — they do not appear in counts, categories, files, export_csv, or any other output.

Pass extensions without the leading dot, lowercase.

# Count only Python files
result = anscom.scan("/repo", extensions=["py"], silent=True)

# Count only web code
result = anscom.scan("/project", extensions=["js", "ts", "jsx", "tsx", "css", "html"])

# Count only media
result = anscom.scan("/media", extensions=["mp4", "mkv", "mov", "avi", "mp3", "flac"])

# Count only documents
result = anscom.scan("/docs", extensions=["pdf", "docx", "xlsx", "pptx", "md", "txt"])

Unknown extensions (not in the built-in table) are also excluded when a whitelist is active.


ignore_junk

Type: boolDefault: False

When True, the following directories are skipped entirely — no opendir, no syscall, no recursion. The check is a case-insensitive match on the directory basename, at any depth, under any parent.

Skipped directories:

Category Directories
Version control .git .svn .hg
IDE metadata .idea .vscode
Dependency trees node_modules bower_components site-packages .venv venv env
Build output build dist target __pycache__
Cache / temp temp tmp .cache .pytest_cache .mypy_cache
# Measure dependency bloat
raw   = anscom.scan("/project", ignore_junk=False, silent=True)
clean = anscom.scan("/project", ignore_junk=True,  silent=True)
bloat = raw["total_files"] - clean["total_files"]
print(f"Dependency files: {bloat:,}")

# Fast production audit — skip all junk
result = anscom.scan("/codebase", ignore_junk=True, workers=32, silent=True)

The default is False — Anscom counts everything unless you opt in to exclusions.


silent

Type: boolDefault: False

When False (default), Anscom prints:

  • A live "Scanned files: N ..." counter that updates every 250ms
  • The full summary report and extension breakdown on completion

When True, all of that is suppressed. The returned dict is always identical regardless of this flag.

silent=True does not suppress tree output from show_tree=True — those are separate.

# For scripting — no output, just the data
result = anscom.scan("/data", silent=True)

# For interactive use — full live output
anscom.scan("/data")

show_tree

Type: boolDefault: False

When True, prints a DFS-ordered directory tree to sys.stdout as each entry is discovered. Forces workers=1 to guarantee correct ordering.

  |-- [src]
  |   |   |-- main.py
  |   |   |-- utils.py
  |   |   |-- [tests]
  |   |   |   |   |-- test_main.py
  |-- [docs]
  |   |   |-- readme.md
  |-- config.json
  • Square brackets [name] indicate a directory
  • No brackets indicates a regular file
  • Each depth level adds " | " (6 characters) of indentation

Output is produced one line at a time via PySys_WriteStdout. Any sys.stdout redirect in Python will capture every line. There is no internal buffer — a 50 million file filesystem produces 50+ million lines without accumulating memory.

# Print tree to terminal
anscom.scan(".", show_tree=True, max_depth=4)

# Capture tree in Python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan(".", show_tree=True, max_depth=3, silent=True)
sys.stdout = sys.__stdout__
tree_text = buf.getvalue()

# Save tree to file (see also export_tree)
anscom.scan("/data", show_tree=True, silent=True, export_tree="tree.txt")

callback

Type: callable | NoneDefault: None

A Python callable invoked approximately every 1 second with the current scanned file count as a single int argument. Fired by the progress thread every 4th tick (250ms × 4 = 1000ms).

The GIL is acquired before each call and released immediately after. Scan worker threads are never blocked by callback invocation.

def on_progress(n):
    print(f"\rScanned: {n:,}", end="", flush=True)

result = anscom.scan("/data", callback=on_progress, silent=True)
print()

# Push to Prometheus
from prometheus_client import Gauge
g = Gauge("files_scanned", "Current file scan count")
anscom.scan("/data", callback=lambda n: g.set(n), silent=True)

export_json

Type: str | NoneDefault: None

Path to write the full result dict as a formatted JSON file. Uses Python's built-in json module — no external dependencies. Written with 4-space indentation after the scan completes.

The JSON file contains all keys that are in the returned dict, including optional keys (files, largest_files, duplicates) when those features are enabled in the same call.

anscom.scan("/data", export_json="report.json", silent=True)

# With optional features — JSON gets those keys too
anscom.scan(
    "/data",
    export_json     = "report.json",
    return_files    = True,
    largest_n       = 10,
    find_duplicates = True,
    silent          = True
)

Example output:

{
    "total_files": 21008,
    "scan_errors": 0,
    "duration_seconds": 1.5186,
    "categories": {
        "Code/Source": 5955,
        "Documents": 203,
        "Images": 151,
        "Videos": 0,
        "Audio": 730,
        "Archives": 0,
        "Executables": 0,
        "System/Config": 5707,
        "Other/Unknown": 8992
    },
    "extensions": {
        "py": 5955,
        "pyc": 5707,
        "mp3": 730,
        "txt": 160,
        "png": 151
    }
}

export_tree

Type: str | NoneDefault: None

Path to write the tree output to a text file. Only active when show_tree=True.

The file is written incrementally — each line is written and flushed as it is produced. For a filesystem with 50 million entries this produces a multi-gigabyte file without accumulating any output in memory. stdout and the file both receive every line simultaneously.

anscom.scan(
    "/mnt/storage",
    show_tree   = True,
    export_tree = "filesystem_tree.txt",
    silent      = True,
    max_depth   = 64
)

export_csv

Type: str | NoneDefault: NoneNew in v1.5.0

Path to write a per-file inventory as a UTF-8 CSV. Columns: path, size, ext, category, mtime.

  • path: full absolute path, RFC 4180-quoted (double-quoted, inner quotes doubled)
  • size: file size in bytes as an integer
  • ext: lowercase extension without the dot (empty string for unrecognized extensions)
  • category: one of the 9 category names
  • mtime: Unix timestamp (seconds since epoch) of last modification
anscom.scan("/data", export_csv="inventory.csv", silent=True)

Loading the CSV downstream:

# With pandas
import pandas as pd
df = pd.read_csv("inventory.csv")
print(df.groupby("category")["size"].sum().sort_values(ascending=False))

# Convert to Excel
df.to_excel("report.xlsx", index=False)

# Standard library only
import csv
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        print(row["path"], row["size"])

# With openpyxl directly
import csv, openpyxl
wb = openpyxl.Workbook()
ws = wb.active
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        ws.append(row)
wb.save("report.xlsx")

return_files

Type: boolDefault: FalseNew in v1.5.0

When True, the result dict gains a "files" key containing a Python list of dicts — one entry per scanned file.

Each dict has five fields:

Field Type Description
path str Full absolute path to the file
size int File size in bytes
ext str Lowercase extension (no dot), empty string if unrecognized
category str One of the 9 category names
mtime int Unix timestamp of last modification
result = anscom.scan("/project", return_files=True, silent=True)

# Iterate
for f in result["files"]:
    print(f["path"], f["size"], f["category"])

# Filter in Python
large_code = [
    f for f in result["files"]
    if f["category"] == "Code/Source" and f["size"] > 50_000
]

# Sort by size descending
by_size = sorted(result["files"], key=lambda f: f["size"], reverse=True)
print("Largest file:", by_size[0]["path"])

# Group by extension
from collections import defaultdict
by_ext = defaultdict(list)
for f in result["files"]:
    by_ext[f["ext"]].append(f)

len(result["files"]) == result["total_files"] is always true.


largest_n

Type: intDefault: 0 (disabled) — New in v1.5.0

When > 0, finds the top N files by size across the entire scanned filesystem. Uses a per-thread min-heap of capacity N — O(log N) per file, no extra pass, no sorting of the full file list. After all threads join, per-thread heaps are merged and sorted descending.

The result dict gains a "largest_files" key where each entry is a dict with path (str) and size (int).

result = anscom.scan("/mnt/storage", largest_n=20, silent=True)

for f in result["largest_files"]:
    gb = f["size"] / (1024 ** 3)
    print(f"{gb:8.2f} GB  {f['path']}")

The printed report also gains a section:

=== TOP 20 LARGEST FILES ===========================
  1073741824 bytes : /data/backup/archive.tar.gz
   536870912 bytes : /data/media/4k_reel.mkv
...
===================================================
# Find the single largest file
result = anscom.scan("/mnt/nas", largest_n=1, silent=True)
top = result["largest_files"][0]
print(f"Largest: {top['path']} ({top['size']:,} bytes)")

# Top 100 across a petabyte volume
result = anscom.scan("/mnt/petabyte", largest_n=100, workers=64, silent=True)

find_duplicates

Type: boolDefault: FalseNew in v1.5.0

When True, detects duplicate files using a two-phase algorithm:

  1. Size bucketing — all files sorted by size. Files with a unique size are skipped entirely — zero I/O.
  2. CRC32 fingerprinting — for each same-size group (≥2 files, non-zero size), the first 4096 bytes of each file are read and CRC32 is computed. Files with matching CRC32 are reported as duplicates.

The result dict gains a "duplicates" key: a list of groups, each group being a list of path strings. Every group has at least 2 members.

result = anscom.scan("/media-library", find_duplicates=True, silent=True)

print(f"Duplicate groups: {len(result['duplicates'])}")

for group in result["duplicates"]:
    print(f"\nDuplicate set ({len(group)} files):")
    for path in group:
        print(f"  {path}")

Calculating reclaimable space (combine with return_files=True):

result = anscom.scan(
    "/mnt/archive",
    find_duplicates = True,
    return_files    = True,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}

wasted = sum(
    sum(size_map.get(p, 0) for p in group[1:])   # keep 1, discard rest
    for group in result["duplicates"]
)

print(f"Reclaimable: {wasted / (1024**3):.2f} GB across {len(result['duplicates'])} groups")

The printed report adds:

=== DUPLICATES SUMMARY ============================
Groups found : 142
===================================================

regex_filter

Type: str | NoneDefault: NoneNew in v1.5.0

A regular expression pattern. When set, only files whose full absolute path matches the pattern are counted, categorized, and included in any file-tracking output (return_files, export_csv, find_duplicates, largest_n).

Platform behavior:

  • Linux / macOS: Compiled with POSIX regcomp(REG_EXTENDED | REG_NOSUB), matched with regexecno GIL acquisition , runs fully in C inside the worker threads.
  • Windows: Falls back to Python's re module (GIL acquired per file). For large scans on Windows, prefer extensions whitelist filtering which has zero GIL cost.

The pattern is also compiled with Python's re.compile before the scan starts. An invalid pattern raises ValueError immediately.

# Only .py files anywhere under a tests/ directory
result = anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)

# Only files in directories named 'src'
result = anscom.scan("/project", regex_filter=r"/src/", silent=True)

# Only Python test files
result = anscom.scan("/repo", regex_filter=r"test_.*\.py$", silent=True)
print(f"Test files: {result['total_files']}")

# Invalid patterns raise ValueError immediately — no scan is started
try:
    anscom.scan("/data", regex_filter=r"[invalid(")
except ValueError as e:
    print(e)  # Failed to compile regex_filter.

Export Features

All export parameters are independent and combinable. A single scan pass can write to all simultaneously — one traversal, multiple outputs, no re-scanning.

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)
# One scan. Four output files. Full in-memory results.
Parameter Format Dependencies Notes
export_json JSON None (built-in) Full result dict including optional keys
export_csv CSV None (built-in) Per-file: path, size, ext, category, mtime
export_tree Plain text Requires show_tree=True Written line-by-line, safe at any scale

Tree Mode

# Basic tree to terminal
anscom.scan(".", show_tree=True)

# Tree saved to file
anscom.scan("/project", show_tree=True, export_tree="tree.txt", silent=True)

# Deep tree, no terminal output
import sys, io
sys.stdout = io.StringIO()
anscom.scan("/mnt/volume", show_tree=True, export_tree="tree.txt", max_depth=64)
sys.stdout = sys.__stdout__

Output format:

  |-- [src]            ← [brackets] = directory
  |   |   |-- main.py  ← no brackets = regular file
  |   |   |-- [lib]
  |   |   |   |   |-- utils.py
  |-- config.json
  |-- [tests]
  |   |   |-- test_core.py
  • One " | " block per depth level (6 chars each)
  • At depth 64: 384 characters of indentation — all structurally valid
  • DFS order is strict: every file inside a directory appears before that directory's sibling
  • workers is forced to 1 — required for correct ordering
  • No internal buffer — safe at 50+ million entries

Exclusion Filter

ignore_junk=True skips these directory names at any depth, at any nesting level:

.git         .svn         .hg          .idea        .vscode
node_modules bower_components site-packages .venv   venv
env          build         dist          target      __pycache__
temp         tmp           .cache        .pytest_cache .mypy_cache

The check is case-insensitive basename comparison — not a path substring match. A node_modules at /project/frontend/node_modules/ is caught regardless of nesting depth.


Report Format

Printed to sys.stdout when silent=False (the default).

Anscom Enterprise v1.5.0 (Threads: 16)
Target: /data

Scanned files: 21008 ...

=== SUMMARY REPORT ================================
+-----------------+--------------+----------+
| Category        | Count        | Percent  |
+-----------------+--------------+----------+
| Code/Source     |         5955 |   28.34% |
| System/Config   |         5707 |   27.16% |
| Other/Unknown   |         8992 |   42.81% |
| Documents       |          203 |    0.97% |
| Images          |          151 |    0.72% |
+-----------------+--------------+----------+
| TOTAL FILES     |        21008 |  100.00% |
+-----------------+--------------+----------+

=== DETAILED EXTENSION BREAKDOWN ==================
+-----------------+--------------+
| Extension       | Count        |
+-----------------+--------------+
| .py             |         5955 |
| .pyc            |         5707 |
| .mp3            |          730 |
| .txt            |          160 |
| .png            |          151 |
+-----------------+--------------+

Time     : 1.5186 seconds
Errors   : 0 (permission denied / inaccessible)
===================================================

=== TOP 20 LARGEST FILES ===========================   ← only with largest_n > 0
  1073741824 bytes : /data/backup/full.tar.gz
...

=== DUPLICATES SUMMARY ============================   ← only with find_duplicates=True
Groups found : 142
===================================================

Capture programmatically:

import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan("/data")
sys.stdout = sys.__stdout__
report_text = buf.getvalue()

File Categories and Extensions

170+ extensions across 9 categories. The table is sorted lexicographically and validated at module init — if the sort invariant is violated, import anscom raises RuntimeError.

Category Sample Extensions
Code/Source c cpp cs go h html java js json jsx kt lua php py r rb rs sh sql swift ts vue xml yaml yml
Documents csv doc docx epub md mobi odp ods odt pdf ppt pptx rst rtf txt xls xlsx
Images ai avif bmp gif heic ico jpeg jpg png psd raw svg tiff webp
Videos avi flv mkv mov mp4 mpeg ogv webm wmv
Audio aac flac m4a mid mp3 ogg wav wma
Archives 7z bz2 deb dmg gz iso jar rar tar tgz zip
Executables app bin class dll elf exe msi pyd so
System/Config bak cfg conf db env gitignore ini log pyc reg sys tmp ttf woff
Other/Unknown Any extension not in the above table

Architecture

OS backends

Three separate scanning implementations compiled and selected at build time:

Platform Backend Mechanism
Linux getdents64 Direct syscall(SYS_getdents64, dirfd, buf, 131072)— raw kernel ABI, 128KB read buffer,d_typefor zero-stat type detection
Windows FindFirstFileW Wide-char wchar_tpaths, UTF-16→UTF-8 conversion, size+mtime from WIN32_FIND_DATAWat no extra syscall cost
macOS / BSD POSIX readdir opendir/readdirwith lstatfor type resolution

Thread model

main thread
  ├── spawn N worker threads (all waiting on cond var)
  ├── spawn 1 progress thread
  ├── push root path to queue
  ├── wait until queue.count == 0 && active_workers == 0
  └── join all threads → merge stats

worker thread (×N)
  └── loop: queue_pop → process_dir_recursive → queue_task_done

process_dir_recursive
  ├── depth < 3: push subdirs to queue (parallel pickup by idle threads)
  └── depth ≥ 3: recurse inline (avoids queue overhead for deep narrow trees)

Per-thread stats — zero locks during counting

Each thread has its own ScanStats struct with ext_counts[170+] and cat_counts[9]. No lock is acquired during file categorization. The only shared atomic write per file is a single __sync_fetch_and_add for the progress counter. Stats are merged in one serial pass after all threads join.

Slab path allocator

Each thread allocates (max_depth + 2) * PATH_MAX bytes once before scanning. Path strings during traversal are written into slab[depth * PATH_MAX] via snprintf. Zero heap allocation during traversal.

Extension hash table

512-slot open-addressing hash table with FNV-1a hash and linear probing. Built once at module init from the sorted extension table. O(1) average lookup, no heap allocation, never modified after init.

FileArray pre-allocation

When return_files, export_csv, or find_duplicates is enabled, each thread pre-allocates a FileInfo array of 65,536 entries before scanning begins. Growth beyond that doubles via realloc. For typical filesystems: zero reallocations during the scan.

Min-heap for largest_n

Each thread maintains a min-heap of capacity N. Per-file cost: O(log N) comparison, no lock. Thread heaps merged globally after join using the same push logic.

Two-phase duplicate detection

  1. qsort all files by size — O(M log M), no I/O
  2. For each same-size group ≥2 members: read first 4KB of each, compute CRC32, sort by CRC32, group consecutive matches

Zero I/O for unique-size files. One bounded read per candidate. CRC32 computed using a fully inlined lookup table — no external library.


Security and Compliance

Property Guarantee
No file contents read Only directory entries and metadata. Exception:find_duplicates=Truereads up to 4KB per candidate — bounded, opt-in, read-only
Symlinks never followed Linux:fstatat(AT_SYMLINK_NOFOLLOW). POSIX:lstat. Windows:FILE_ATTRIBUTE_REPARSE_POINTskipped unconditionally
Depth hard-capped at 64 Enforced in C at the top of every process_dir_recursivecall — cannot be bypassed by filesystem topology
All path assembly bounded snprintf(slab, PATH_MAX, ...)— always null-terminated, always within PATH_MAXbytes
Errors counted, not silenced Every failed opendir/open/FindFirstFileWincrements scan_errorsand continues — the final count is exact
Work queue bounded 131,072 fixed slots. Overflow falls back to inline recursion — no unbounded allocation
Hash table immutable after init Built once at module load. No runtime modification
Zero external dependencies No mandatory third-party packages — no supply chain surface

Enterprise Recipes

Storage cost allocation

import anscom

result = anscom.scan("/mnt/nas", workers=16, ignore_junk=True, silent=True)
total = result["total_files"]
cats  = result["categories"]

media = cats["Videos"] + cats["Images"] + cats["Audio"]
code  = cats["Code/Source"]
docs  = cats["Documents"]

print(f"Media   : {media:>10,}  ({media/total*100:5.1f}%)")
print(f"Code    : {code:>10,}  ({code/total*100:5.1f}%)")
print(f"Docs    : {docs:>10,}  ({docs/total*100:5.1f}%)")
print(f"Total   : {total:>10,}  in {result['duration_seconds']:.2f}s")

Pre-migration audit

import anscom

result = anscom.scan(
    "/legacy-server/data",
    max_depth    = 30,
    silent       = True,
    return_files = True,
    export_json  = "audit.json",
    export_csv   = "inventory.csv"
)

print(f"Recorded {result['total_files']:,} files")
print(f"Errors  : {result['scan_errors']}")

CI/CD policy gate

import anscom, sys

result = anscom.scan("./repo", silent=True, ignore_junk=True)

violations = []
if result["categories"]["Executables"] > 0:
    violations.append(f"{result['categories']['Executables']} executable files")
if result["categories"]["Videos"] > 0:
    violations.append(f"{result['categories']['Videos']} video files")

if violations:
    for v in violations:
        print(f"POLICY VIOLATION: {v}")
    sys.exit(1)

print("File composition check passed.")

Storage reclamation

import anscom

result = anscom.scan(
    "/mnt/media-archive",
    find_duplicates = True,
    return_files    = True,
    workers         = 16,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}
wasted   = sum(
    sum(size_map.get(p, 0) for p in group[1:])
    for group in result["duplicates"]
)

print(f"Duplicate groups : {len(result['duplicates'])}")
print(f"Reclaimable      : {wasted / (1024**3):.2f} GB")

groups_by_waste = sorted(
    result["duplicates"],
    key=lambda g: sum(size_map.get(p, 0) for p in g[1:]),
    reverse=True
)
for group in groups_by_waste[:5]:
    waste = sum(size_map.get(p, 0) for p in group[1:])
    print(f"\n  {waste / (1024**2):.1f} MB wasted:")
    for path in group:
        print(f"    {path}")

Top-100 largest files

import anscom

result = anscom.scan("/mnt/storage", largest_n=100, workers=32, silent=True)

total_gb = sum(f["size"] for f in result["largest_files"]) / (1024**3)
print(f"Top 100 total: {total_gb:.1f} GB\n")

for i, f in enumerate(result["largest_files"][:10], 1):
    print(f"{i:3}. {f['size']/1024**3:8.2f} GB  {f['path']}")

Regex scan — test files only

import anscom
from collections import Counter
import os

result = anscom.scan(
    "/codebase",
    regex_filter = r"/tests?/.*\.py$",
    return_files = True,
    silent       = True
)

print(f"Test files: {result['total_files']}")

dirs = Counter(os.path.dirname(f["path"]) for f in result["files"])
for d, count in dirs.most_common(10):
    print(f"  {count:4d}  {d}")

Live Prometheus push

import anscom
from prometheus_client import Gauge, start_http_server

start_http_server(9090)
g_progress = Gauge("anscom_files_scanned",   "Files scanned so far")
g_total    = Gauge("anscom_total_files",      "Total files found")
g_duration = Gauge("anscom_duration_seconds", "Scan duration")

result = anscom.scan(
    "/data-lake",
    callback = lambda n: g_progress.set(n),
    silent   = True,
    workers  = 32
)

g_total.set(result["total_files"])
g_duration.set(result["duration_seconds"])

Full audit — everything at once

import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")
print("Written      : audit.json  inventory.csv  tree.txt")

Changelog

v1.5.0 (current)

  • Added return_files — per-file list in result dict with path, size, ext, category, mtime
  • Added export_csv — per-file inventory as UTF-8 CSV, zero dependencies, RFC 4180-compliant quoting
  • Added largest_n — top-N files by size using per-thread min-heap, O(log N) per file
  • Added find_duplicates — size-bucket + CRC32 duplicate detection, zero I/O for unique-size files
  • Added regex_filter — path pattern filter; POSIX regexec on Linux/macOS (no GIL), Python re fallback on Windows
  • Added FILEARRAY_INIT_CAP (65536) pre-allocation per thread — zero reallocations for typical scans
  • Fixed fstatat on Linux called only when needed — two separate guards for type resolution vs. size/mtime collection
  • Fixed sorted_top paths are strdup'd independently from global_heap — no lifetime overlap, no double-free
  • Removed export_excel — was crashing on Windows due to openpyxl Workbook.read_only exception; use export_csv + pandas.to_excel() instead
  • Improved Full docstring on anscom.scan accessible via help(anscom.scan)

v1.3.0

  • Added export_json, export_excel, export_tree
  • Fixed DFS tree output ordering
  • Added file tracking in tree mode

v1.2.0 and earlier

  • Multi-threaded worker pool with condition-variable termination detection
  • getdents64 direct syscall on Linux
  • Per-thread statistics, zero shared state during scan
  • ignore_junk, min_size, extensions, callback, silent

License

MIT License. Free for personal and commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anscom-1.5.0.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anscom-1.5.0-cp313-cp313-win_amd64.whl (32.8 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file anscom-1.5.0.tar.gz.

File metadata

  • Download URL: anscom-1.5.0.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for anscom-1.5.0.tar.gz
Algorithm Hash digest
SHA256 74af1b2a8939f8209daa9c60a8c3d539b1dd0044a4be0ac1f1ae525c44c96a7c
MD5 82eb3ebbff07cbe40dc235560fd71806
BLAKE2b-256 e0ecc1e4b46b97ee975c88cb88f001c35f4895e6f42390581e86f5cbd36a64d7

See more details on using hashes here.

File details

Details for the file anscom-1.5.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: anscom-1.5.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 32.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for anscom-1.5.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 3fb31007d96e46420bb02b43636b3fbf944cf99bf841f12c618cc794bb246184
MD5 3785a185f73d6da2c50be351f9458b42
BLAKE2b-256 3c8857d8bca5db03765ae5129c2842277f32279193a312276b2e0916771c1412

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page