High-performance native C recursive file scanner: multi-threaded, terabyte-scale, with CSV/JSON/Tree export, duplicate detection, largest-N report, and regex filtering.
Project description
Anscom
High-performance native C recursive file scanner for Python. v1.5.0
MIT Licensed
Multi-threaded · Terabyte-scale · Zero dependencies · Cross-platform
pip install anscom
What it is
Anscom is a Python C extension that scans directories at raw OS speed. It uses direct kernel syscalls (getdents64 on Linux, FindFirstFileW on Windows, readdir/lstat on macOS), a multi-threaded work queue, and per-thread statistics accumulation. It never loads file contents into memory. It never follows symlinks. It never slows down as the filesystem grows.
The result is always a plain Python dict — five keys minimum, more when you ask for them.
import anscom
result = anscom.scan("/mnt/storage")
# → {'total_files': 2841903, 'scan_errors': 0, 'duration_seconds': 1.87,
# 'categories': {...}, 'extensions': {...}}
2.8 million files. 1.87 seconds. 16 threads. No configuration.
What's New in v1.5.0
v1.5.0 is a major feature release — the largest single update since the initial release. Every existing parameter, behavior, and output format from v1.3.0 is fully preserved.
| Feature | Parameter | Description |
|---|---|---|
| File list return | return_files=True |
Returns every scanned file as a list of dicts with path,size,ext,category,mtime |
| CSV export | export_csv="out.csv" |
Writes per-file data to a UTF-8 CSV — zero dependencies |
| Largest-N report | largest_n=20 |
Top N files by size via per-thread min-heap — O(log N) per file, no extra pass |
| Duplicate detection | find_duplicates=True |
Groups files by size then CRC32 of first 4KB — returns grouped path lists |
| Regex filter | regex_filter="pattern" |
Only counts files whose full path matches the pattern. Uses POSIX regexecon Linux/macOS (no GIL); Python refallback on Windows |
Performance note: All five features are strictly opt-in. A plain anscom.scan(".") with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.
Migration from v1.3.0
No breaking changes. All v1.3.0 code runs unchanged on v1.5.0. The new parameters all default to off.
# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)
# v1.5.0 — opt into new features as needed
result = anscom.scan(
"/data",
silent = True,
ignore_junk = True,
return_files = True, # new
largest_n = 20, # new
find_duplicates = True, # new
export_csv = "inventory.csv", # new
)
Table of Contents
- Installation
- Quick Start
- Full API Reference
- Return Value
- All Parameters in Depth
- Export Features
- Tree Mode
- Exclusion Filter
- Report Format
- File Categories and Extensions
- Architecture
- Security and Compliance
- Enterprise Recipes
- Changelog
- License
Installation
pip install anscom
Requires Python 3.6+. Works on Linux, macOS, and Windows.
Windows source builds require the "Desktop development with C++" workload from Visual Studio Build Tools.
No runtime dependencies. Every feature in v1.5.0 works with nothing else installed.
Verify
import anscom
r = anscom.scan(".", silent=True)
print(r["total_files"], "files —", round(r["duration_seconds"], 3), "s")
Quick Start
import anscom
# Default scan — prints live counter + full report
anscom.scan(".")
# Silent scan — just get the dict
result = anscom.scan(".", silent=True)
# Scan a specific path with more depth
result = anscom.scan("/home/user/projects", max_depth=20, silent=True)
# Print the category breakdown
for cat, count in result["categories"].items():
if count > 0:
print(f"{cat:20s} {count:>10,}")
Full API Reference
anscom.scan(
path, # str — required
max_depth = 6, # int
show_tree = False, # bool
workers = 0, # int
min_size = 0, # int
extensions = None, # list[str] | None
callback = None, # callable | None
silent = False, # bool
ignore_junk = False, # bool
export_json = None, # str | None
export_tree = None, # str | None
return_files = False, # bool ← new in v1.5.0
export_csv = None, # str | None ← new in v1.5.0
largest_n = 0, # int ← new in v1.5.0
find_duplicates = False, # bool ← new in v1.5.0
regex_filter = None, # str | None ← new in v1.5.0
) -> dict
Return Value
The return value is always a dict. Five keys are always present. Three are added on demand.
| Key | Type | Always? | Description |
|---|---|---|---|
total_files |
int |
✓ | Files that passed all filters and were categorized |
scan_errors |
int |
✓ | Paths that failed to open (permissions, broken links) |
duration_seconds |
float |
✓ | Wall-clock time from first thread spawn to last join |
categories |
dict[str, int] |
✓ | All 9 categories, always present even if zero |
extensions |
dict[str, int] |
✓ | Only non-zero extension counts |
files |
list[dict] |
return_files=True |
Per-file records:path,size,ext,category,mtime |
largest_files |
list[dict] |
largest_n > 0 |
Top-N files by size:path,size |
duplicates |
list[list[str]] |
find_duplicates=True |
Groups of paths sharing identical content (size + CRC32) |
The nine category keys inside result["categories"]:
"Code/Source" "Documents" "Images" "Videos"
"Audio" "Archives" "Executables" "System/Config"
"Other/Unknown"
All Parameters in Depth
path
Type: str — Required
The root directory to scan. Accepts relative paths (., ../data), absolute paths (/mnt/storage, C:\Users), or an empty string (treated as .).
anscom.scan(".")
anscom.scan("/mnt/nas")
anscom.scan("C:\\Users\\Aditya\\Documents")
anscom.scan("") # same as "."
max_depth
Type: int — Default: 6 — Range: [0, 64]
Maximum directory recursion depth. Depth 0 means only the immediate children of path are examined — no subdirectories are entered. Depth 64 is the hard ceiling enforced in C.
# Only the top level — no recursion
anscom.scan("/data", max_depth=0, silent=True)
# Standard project scan
anscom.scan("/project", max_depth=6, silent=True)
# Deep NAS or archive scan
anscom.scan("/mnt/archive", max_depth=30, silent=True)
# Maximum depth — unlimited for practical purposes
anscom.scan("/", max_depth=64, silent=True)
Values below 0 are clamped to 0. Values above 64 are clamped to 64.
workers
Type: int — Default: 0
Number of worker threads. 0 auto-detects the hardware CPU count via sysconf(_SC_NPROCESSORS_ONLN) on Linux/macOS and GetSystemInfo() on Windows. If auto-detection fails, falls back to 4.
When show_tree=True, workers is forced to 1 regardless of what is passed — multiple threads writing to stdout would produce interleaved output.
# Auto (recommended for most cases)
anscom.scan("/data", workers=0)
# Pin to a specific count
anscom.scan("/data", workers=8)
# Maximum parallelism on a 64-core machine
anscom.scan("/data", workers=64)
At shallow depths the work queue feeds all threads efficiently. At depth >= 3 each thread recurses inline, so thread count has diminishing returns past ~16 for typical filesystems unless the tree is extremely wide.
min_size
Type: int — Default: 0 (no filter)
Skip all files smaller than this many bytes. Files below the threshold are not counted, not categorized, and not included in return_files or export_csv output.
# Only files larger than 1 MB
anscom.scan("/data", min_size=1024 * 1024, silent=True)
# Only files larger than 100 MB
anscom.scan("/mnt/video", min_size=100 * 1024 * 1024, silent=True)
# Only files larger than 1 GB
anscom.scan("/mnt/backup", min_size=1024 ** 3, silent=True)
On Linux, fstatat() is called to retrieve file size only when this filter is active. On Windows, the size is available directly in WIN32_FIND_DATAW at no extra syscall cost.
extensions
Type: list[str] | None — Default: None
Extension whitelist. When set, only files whose extension matches one of the listed strings are counted. All other files are silently skipped — they do not appear in counts, categories, files, export_csv, or any other output.
Pass extensions without the leading dot, lowercase.
# Count only Python files
result = anscom.scan("/repo", extensions=["py"], silent=True)
# Count only web code
result = anscom.scan("/project", extensions=["js", "ts", "jsx", "tsx", "css", "html"])
# Count only media
result = anscom.scan("/media", extensions=["mp4", "mkv", "mov", "avi", "mp3", "flac"])
# Count only documents
result = anscom.scan("/docs", extensions=["pdf", "docx", "xlsx", "pptx", "md", "txt"])
Unknown extensions (not in the built-in table) are also excluded when a whitelist is active.
ignore_junk
Type: bool — Default: False
When True, the following directories are skipped entirely — no opendir, no syscall, no recursion. The check is a case-insensitive match on the directory basename, at any depth, under any parent.
Skipped directories:
| Category | Directories |
|---|---|
| Version control | .git .svn .hg |
| IDE metadata | .idea .vscode |
| Dependency trees | node_modules bower_components site-packages .venv venv env |
| Build output | build dist target __pycache__ |
| Cache / temp | temp tmp .cache .pytest_cache .mypy_cache |
# Measure dependency bloat
raw = anscom.scan("/project", ignore_junk=False, silent=True)
clean = anscom.scan("/project", ignore_junk=True, silent=True)
bloat = raw["total_files"] - clean["total_files"]
print(f"Dependency files: {bloat:,}")
# Fast production audit — skip all junk
result = anscom.scan("/codebase", ignore_junk=True, workers=32, silent=True)
The default is False — Anscom counts everything unless you opt in to exclusions.
silent
Type: bool — Default: False
When False (default), Anscom prints:
- A live "Scanned files: N ..." counter that updates every 250ms
- The full summary report and extension breakdown on completion
When True, all of that is suppressed. The returned dict is always identical regardless of this flag.
silent=True does not suppress tree output from show_tree=True — those are separate.
# For scripting — no output, just the data
result = anscom.scan("/data", silent=True)
# For interactive use — full live output
anscom.scan("/data")
show_tree
Type: bool — Default: False
When True, prints a DFS-ordered directory tree to sys.stdout as each entry is discovered. Forces workers=1 to guarantee correct ordering.
|-- [src]
| | |-- main.py
| | |-- utils.py
| | |-- [tests]
| | | | |-- test_main.py
|-- [docs]
| | |-- readme.md
|-- config.json
- Square brackets
[name]indicate a directory - No brackets indicates a regular file
- Each depth level adds
" | "(6 characters) of indentation
Output is produced one line at a time via PySys_WriteStdout. Any sys.stdout redirect in Python will capture every line. There is no internal buffer — a 50 million file filesystem produces 50+ million lines without accumulating memory.
# Print tree to terminal
anscom.scan(".", show_tree=True, max_depth=4)
# Capture tree in Python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan(".", show_tree=True, max_depth=3, silent=True)
sys.stdout = sys.__stdout__
tree_text = buf.getvalue()
# Save tree to file (see also export_tree)
anscom.scan("/data", show_tree=True, silent=True, export_tree="tree.txt")
callback
Type: callable | None — Default: None
A Python callable invoked approximately every 1 second with the current scanned file count as a single int argument. Fired by the progress thread every 4th tick (250ms × 4 = 1000ms).
The GIL is acquired before each call and released immediately after. Scan worker threads are never blocked by callback invocation.
def on_progress(n):
print(f"\rScanned: {n:,}", end="", flush=True)
result = anscom.scan("/data", callback=on_progress, silent=True)
print()
# Push to Prometheus
from prometheus_client import Gauge
g = Gauge("files_scanned", "Current file scan count")
anscom.scan("/data", callback=lambda n: g.set(n), silent=True)
export_json
Type: str | None — Default: None
Path to write the full result dict as a formatted JSON file. Uses Python's built-in json module — no external dependencies. Written with 4-space indentation after the scan completes.
The JSON file contains all keys that are in the returned dict, including optional keys (files, largest_files, duplicates) when those features are enabled in the same call.
anscom.scan("/data", export_json="report.json", silent=True)
# With optional features — JSON gets those keys too
anscom.scan(
"/data",
export_json = "report.json",
return_files = True,
largest_n = 10,
find_duplicates = True,
silent = True
)
Example output:
{
"total_files": 21008,
"scan_errors": 0,
"duration_seconds": 1.5186,
"categories": {
"Code/Source": 5955,
"Documents": 203,
"Images": 151,
"Videos": 0,
"Audio": 730,
"Archives": 0,
"Executables": 0,
"System/Config": 5707,
"Other/Unknown": 8992
},
"extensions": {
"py": 5955,
"pyc": 5707,
"mp3": 730,
"txt": 160,
"png": 151
}
}
export_tree
Type: str | None — Default: None
Path to write the tree output to a text file. Only active when show_tree=True.
The file is written incrementally — each line is written and flushed as it is produced. For a filesystem with 50 million entries this produces a multi-gigabyte file without accumulating any output in memory. stdout and the file both receive every line simultaneously.
anscom.scan(
"/mnt/storage",
show_tree = True,
export_tree = "filesystem_tree.txt",
silent = True,
max_depth = 64
)
export_csv
Type: str | None — Default: None — New in v1.5.0
Path to write a per-file inventory as a UTF-8 CSV. Columns: path, size, ext, category, mtime.
path: full absolute path, RFC 4180-quoted (double-quoted, inner quotes doubled)size: file size in bytes as an integerext: lowercase extension without the dot (empty string for unrecognized extensions)category: one of the 9 category namesmtime: Unix timestamp (seconds since epoch) of last modification
anscom.scan("/data", export_csv="inventory.csv", silent=True)
Loading the CSV downstream:
# With pandas
import pandas as pd
df = pd.read_csv("inventory.csv")
print(df.groupby("category")["size"].sum().sort_values(ascending=False))
# Convert to Excel
df.to_excel("report.xlsx", index=False)
# Standard library only
import csv
with open("inventory.csv", newline="", encoding="utf-8") as f:
for row in csv.DictReader(f):
print(row["path"], row["size"])
# With openpyxl directly
import csv, openpyxl
wb = openpyxl.Workbook()
ws = wb.active
with open("inventory.csv", newline="", encoding="utf-8") as f:
for row in csv.reader(f):
ws.append(row)
wb.save("report.xlsx")
return_files
Type: bool — Default: False — New in v1.5.0
When True, the result dict gains a "files" key containing a Python list of dicts — one entry per scanned file.
Each dict has five fields:
| Field | Type | Description |
|---|---|---|
path |
str |
Full absolute path to the file |
size |
int |
File size in bytes |
ext |
str |
Lowercase extension (no dot), empty string if unrecognized |
category |
str |
One of the 9 category names |
mtime |
int |
Unix timestamp of last modification |
result = anscom.scan("/project", return_files=True, silent=True)
# Iterate
for f in result["files"]:
print(f["path"], f["size"], f["category"])
# Filter in Python
large_code = [
f for f in result["files"]
if f["category"] == "Code/Source" and f["size"] > 50_000
]
# Sort by size descending
by_size = sorted(result["files"], key=lambda f: f["size"], reverse=True)
print("Largest file:", by_size[0]["path"])
# Group by extension
from collections import defaultdict
by_ext = defaultdict(list)
for f in result["files"]:
by_ext[f["ext"]].append(f)
len(result["files"]) == result["total_files"] is always true.
largest_n
Type: int — Default: 0 (disabled) — New in v1.5.0
When > 0, finds the top N files by size across the entire scanned filesystem. Uses a per-thread min-heap of capacity N — O(log N) per file, no extra pass, no sorting of the full file list. After all threads join, per-thread heaps are merged and sorted descending.
The result dict gains a "largest_files" key where each entry is a dict with path (str) and size (int).
result = anscom.scan("/mnt/storage", largest_n=20, silent=True)
for f in result["largest_files"]:
gb = f["size"] / (1024 ** 3)
print(f"{gb:8.2f} GB {f['path']}")
The printed report also gains a section:
=== TOP 20 LARGEST FILES ===========================
1073741824 bytes : /data/backup/archive.tar.gz
536870912 bytes : /data/media/4k_reel.mkv
...
===================================================
# Find the single largest file
result = anscom.scan("/mnt/nas", largest_n=1, silent=True)
top = result["largest_files"][0]
print(f"Largest: {top['path']} ({top['size']:,} bytes)")
# Top 100 across a petabyte volume
result = anscom.scan("/mnt/petabyte", largest_n=100, workers=64, silent=True)
find_duplicates
Type: bool — Default: False — New in v1.5.0
When True, detects duplicate files using a two-phase algorithm:
- Size bucketing — all files sorted by size. Files with a unique size are skipped entirely — zero I/O.
- CRC32 fingerprinting — for each same-size group (≥2 files, non-zero size), the first 4096 bytes of each file are read and CRC32 is computed. Files with matching CRC32 are reported as duplicates.
The result dict gains a "duplicates" key: a list of groups, each group being a list of path strings. Every group has at least 2 members.
result = anscom.scan("/media-library", find_duplicates=True, silent=True)
print(f"Duplicate groups: {len(result['duplicates'])}")
for group in result["duplicates"]:
print(f"\nDuplicate set ({len(group)} files):")
for path in group:
print(f" {path}")
Calculating reclaimable space (combine with return_files=True):
result = anscom.scan(
"/mnt/archive",
find_duplicates = True,
return_files = True,
silent = True
)
size_map = {f["path"]: f["size"] for f in result["files"]}
wasted = sum(
sum(size_map.get(p, 0) for p in group[1:]) # keep 1, discard rest
for group in result["duplicates"]
)
print(f"Reclaimable: {wasted / (1024**3):.2f} GB across {len(result['duplicates'])} groups")
The printed report adds:
=== DUPLICATES SUMMARY ============================
Groups found : 142
===================================================
regex_filter
Type: str | None — Default: None — New in v1.5.0
A regular expression pattern. When set, only files whose full absolute path matches the pattern are counted, categorized, and included in any file-tracking output (return_files, export_csv, find_duplicates, largest_n).
Platform behavior:
- Linux / macOS: Compiled with POSIX
regcomp(REG_EXTENDED | REG_NOSUB), matched withregexec— no GIL acquisition , runs fully in C inside the worker threads. - Windows: Falls back to Python's
remodule (GIL acquired per file). For large scans on Windows, preferextensionswhitelist filtering which has zero GIL cost.
The pattern is also compiled with Python's re.compile before the scan starts. An invalid pattern raises ValueError immediately.
# Only .py files anywhere under a tests/ directory
result = anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)
# Only files in directories named 'src'
result = anscom.scan("/project", regex_filter=r"/src/", silent=True)
# Only Python test files
result = anscom.scan("/repo", regex_filter=r"test_.*\.py$", silent=True)
print(f"Test files: {result['total_files']}")
# Invalid patterns raise ValueError immediately — no scan is started
try:
anscom.scan("/data", regex_filter=r"[invalid(")
except ValueError as e:
print(e) # Failed to compile regex_filter.
Export Features
All export parameters are independent and combinable. A single scan pass can write to all simultaneously — one traversal, multiple outputs, no re-scanning.
result = anscom.scan(
"/mnt/enterprise",
max_depth = 20,
workers = 32,
ignore_junk = True,
silent = True,
largest_n = 50,
find_duplicates = True,
return_files = True,
export_json = "audit.json",
export_csv = "inventory.csv",
show_tree = True,
export_tree = "tree.txt",
)
# One scan. Four output files. Full in-memory results.
| Parameter | Format | Dependencies | Notes |
|---|---|---|---|
export_json |
JSON | None (built-in) | Full result dict including optional keys |
export_csv |
CSV | None (built-in) | Per-file: path, size, ext, category, mtime |
export_tree |
Plain text | Requires show_tree=True |
Written line-by-line, safe at any scale |
Tree Mode
# Basic tree to terminal
anscom.scan(".", show_tree=True)
# Tree saved to file
anscom.scan("/project", show_tree=True, export_tree="tree.txt", silent=True)
# Deep tree, no terminal output
import sys, io
sys.stdout = io.StringIO()
anscom.scan("/mnt/volume", show_tree=True, export_tree="tree.txt", max_depth=64)
sys.stdout = sys.__stdout__
Output format:
|-- [src] ← [brackets] = directory
| | |-- main.py ← no brackets = regular file
| | |-- [lib]
| | | | |-- utils.py
|-- config.json
|-- [tests]
| | |-- test_core.py
- One
" | "block per depth level (6 chars each) - At depth 64: 384 characters of indentation — all structurally valid
- DFS order is strict: every file inside a directory appears before that directory's sibling
workersis forced to 1 — required for correct ordering- No internal buffer — safe at 50+ million entries
Exclusion Filter
ignore_junk=True skips these directory names at any depth, at any nesting level:
.git .svn .hg .idea .vscode
node_modules bower_components site-packages .venv venv
env build dist target __pycache__
temp tmp .cache .pytest_cache .mypy_cache
The check is case-insensitive basename comparison — not a path substring match. A node_modules at /project/frontend/node_modules/ is caught regardless of nesting depth.
Report Format
Printed to sys.stdout when silent=False (the default).
Anscom Enterprise v1.5.0 (Threads: 16)
Target: /data
Scanned files: 21008 ...
=== SUMMARY REPORT ================================
+-----------------+--------------+----------+
| Category | Count | Percent |
+-----------------+--------------+----------+
| Code/Source | 5955 | 28.34% |
| System/Config | 5707 | 27.16% |
| Other/Unknown | 8992 | 42.81% |
| Documents | 203 | 0.97% |
| Images | 151 | 0.72% |
+-----------------+--------------+----------+
| TOTAL FILES | 21008 | 100.00% |
+-----------------+--------------+----------+
=== DETAILED EXTENSION BREAKDOWN ==================
+-----------------+--------------+
| Extension | Count |
+-----------------+--------------+
| .py | 5955 |
| .pyc | 5707 |
| .mp3 | 730 |
| .txt | 160 |
| .png | 151 |
+-----------------+--------------+
Time : 1.5186 seconds
Errors : 0 (permission denied / inaccessible)
===================================================
=== TOP 20 LARGEST FILES =========================== ← only with largest_n > 0
1073741824 bytes : /data/backup/full.tar.gz
...
=== DUPLICATES SUMMARY ============================ ← only with find_duplicates=True
Groups found : 142
===================================================
Capture programmatically:
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan("/data")
sys.stdout = sys.__stdout__
report_text = buf.getvalue()
File Categories and Extensions
170+ extensions across 9 categories. The table is sorted lexicographically and validated at module init — if the sort invariant is violated, import anscom raises RuntimeError.
| Category | Sample Extensions |
|---|---|
| Code/Source | c cpp cs go h html java js json jsx kt lua php py r rb rs sh sql swift ts vue xml yaml yml |
| Documents | csv doc docx epub md mobi odp ods odt pdf ppt pptx rst rtf txt xls xlsx |
| Images | ai avif bmp gif heic ico jpeg jpg png psd raw svg tiff webp |
| Videos | avi flv mkv mov mp4 mpeg ogv webm wmv |
| Audio | aac flac m4a mid mp3 ogg wav wma |
| Archives | 7z bz2 deb dmg gz iso jar rar tar tgz zip |
| Executables | app bin class dll elf exe msi pyd so |
| System/Config | bak cfg conf db env gitignore ini log pyc reg sys tmp ttf woff |
| Other/Unknown | Any extension not in the above table |
Architecture
OS backends
Three separate scanning implementations compiled and selected at build time:
| Platform | Backend | Mechanism |
|---|---|---|
| Linux | getdents64 |
Direct syscall(SYS_getdents64, dirfd, buf, 131072)— raw kernel ABI, 128KB read buffer,d_typefor zero-stat type detection |
| Windows | FindFirstFileW |
Wide-char wchar_tpaths, UTF-16→UTF-8 conversion, size+mtime from WIN32_FIND_DATAWat no extra syscall cost |
| macOS / BSD | POSIX readdir |
opendir/readdirwith lstatfor type resolution |
Thread model
main thread
├── spawn N worker threads (all waiting on cond var)
├── spawn 1 progress thread
├── push root path to queue
├── wait until queue.count == 0 && active_workers == 0
└── join all threads → merge stats
worker thread (×N)
└── loop: queue_pop → process_dir_recursive → queue_task_done
process_dir_recursive
├── depth < 3: push subdirs to queue (parallel pickup by idle threads)
└── depth ≥ 3: recurse inline (avoids queue overhead for deep narrow trees)
Per-thread stats — zero locks during counting
Each thread has its own ScanStats struct with ext_counts[170+] and cat_counts[9]. No lock is acquired during file categorization. The only shared atomic write per file is a single __sync_fetch_and_add for the progress counter. Stats are merged in one serial pass after all threads join.
Slab path allocator
Each thread allocates (max_depth + 2) * PATH_MAX bytes once before scanning. Path strings during traversal are written into slab[depth * PATH_MAX] via snprintf. Zero heap allocation during traversal.
Extension hash table
512-slot open-addressing hash table with FNV-1a hash and linear probing. Built once at module init from the sorted extension table. O(1) average lookup, no heap allocation, never modified after init.
FileArray pre-allocation
When return_files, export_csv, or find_duplicates is enabled, each thread pre-allocates a FileInfo array of 65,536 entries before scanning begins. Growth beyond that doubles via realloc. For typical filesystems: zero reallocations during the scan.
Min-heap for largest_n
Each thread maintains a min-heap of capacity N. Per-file cost: O(log N) comparison, no lock. Thread heaps merged globally after join using the same push logic.
Two-phase duplicate detection
qsortall files by size — O(M log M), no I/O- For each same-size group ≥2 members: read first 4KB of each, compute CRC32, sort by CRC32, group consecutive matches
Zero I/O for unique-size files. One bounded read per candidate. CRC32 computed using a fully inlined lookup table — no external library.
Security and Compliance
| Property | Guarantee |
|---|---|
| No file contents read | Only directory entries and metadata. Exception:find_duplicates=Truereads up to 4KB per candidate — bounded, opt-in, read-only |
| Symlinks never followed | Linux:fstatat(AT_SYMLINK_NOFOLLOW). POSIX:lstat. Windows:FILE_ATTRIBUTE_REPARSE_POINTskipped unconditionally |
| Depth hard-capped at 64 | Enforced in C at the top of every process_dir_recursivecall — cannot be bypassed by filesystem topology |
| All path assembly bounded | snprintf(slab, PATH_MAX, ...)— always null-terminated, always within PATH_MAXbytes |
| Errors counted, not silenced | Every failed opendir/open/FindFirstFileWincrements scan_errorsand continues — the final count is exact |
| Work queue bounded | 131,072 fixed slots. Overflow falls back to inline recursion — no unbounded allocation |
| Hash table immutable after init | Built once at module load. No runtime modification |
| Zero external dependencies | No mandatory third-party packages — no supply chain surface |
Enterprise Recipes
Storage cost allocation
import anscom
result = anscom.scan("/mnt/nas", workers=16, ignore_junk=True, silent=True)
total = result["total_files"]
cats = result["categories"]
media = cats["Videos"] + cats["Images"] + cats["Audio"]
code = cats["Code/Source"]
docs = cats["Documents"]
print(f"Media : {media:>10,} ({media/total*100:5.1f}%)")
print(f"Code : {code:>10,} ({code/total*100:5.1f}%)")
print(f"Docs : {docs:>10,} ({docs/total*100:5.1f}%)")
print(f"Total : {total:>10,} in {result['duration_seconds']:.2f}s")
Pre-migration audit
import anscom
result = anscom.scan(
"/legacy-server/data",
max_depth = 30,
silent = True,
return_files = True,
export_json = "audit.json",
export_csv = "inventory.csv"
)
print(f"Recorded {result['total_files']:,} files")
print(f"Errors : {result['scan_errors']}")
CI/CD policy gate
import anscom, sys
result = anscom.scan("./repo", silent=True, ignore_junk=True)
violations = []
if result["categories"]["Executables"] > 0:
violations.append(f"{result['categories']['Executables']} executable files")
if result["categories"]["Videos"] > 0:
violations.append(f"{result['categories']['Videos']} video files")
if violations:
for v in violations:
print(f"POLICY VIOLATION: {v}")
sys.exit(1)
print("File composition check passed.")
Storage reclamation
import anscom
result = anscom.scan(
"/mnt/media-archive",
find_duplicates = True,
return_files = True,
workers = 16,
silent = True
)
size_map = {f["path"]: f["size"] for f in result["files"]}
wasted = sum(
sum(size_map.get(p, 0) for p in group[1:])
for group in result["duplicates"]
)
print(f"Duplicate groups : {len(result['duplicates'])}")
print(f"Reclaimable : {wasted / (1024**3):.2f} GB")
groups_by_waste = sorted(
result["duplicates"],
key=lambda g: sum(size_map.get(p, 0) for p in g[1:]),
reverse=True
)
for group in groups_by_waste[:5]:
waste = sum(size_map.get(p, 0) for p in group[1:])
print(f"\n {waste / (1024**2):.1f} MB wasted:")
for path in group:
print(f" {path}")
Top-100 largest files
import anscom
result = anscom.scan("/mnt/storage", largest_n=100, workers=32, silent=True)
total_gb = sum(f["size"] for f in result["largest_files"]) / (1024**3)
print(f"Top 100 total: {total_gb:.1f} GB\n")
for i, f in enumerate(result["largest_files"][:10], 1):
print(f"{i:3}. {f['size']/1024**3:8.2f} GB {f['path']}")
Regex scan — test files only
import anscom
from collections import Counter
import os
result = anscom.scan(
"/codebase",
regex_filter = r"/tests?/.*\.py$",
return_files = True,
silent = True
)
print(f"Test files: {result['total_files']}")
dirs = Counter(os.path.dirname(f["path"]) for f in result["files"])
for d, count in dirs.most_common(10):
print(f" {count:4d} {d}")
Live Prometheus push
import anscom
from prometheus_client import Gauge, start_http_server
start_http_server(9090)
g_progress = Gauge("anscom_files_scanned", "Files scanned so far")
g_total = Gauge("anscom_total_files", "Total files found")
g_duration = Gauge("anscom_duration_seconds", "Scan duration")
result = anscom.scan(
"/data-lake",
callback = lambda n: g_progress.set(n),
silent = True,
workers = 32
)
g_total.set(result["total_files"])
g_duration.set(result["duration_seconds"])
Full audit — everything at once
import anscom
result = anscom.scan(
"/mnt/enterprise",
max_depth = 20,
workers = 32,
ignore_junk = True,
silent = True,
largest_n = 50,
find_duplicates = True,
return_files = True,
export_json = "audit.json",
export_csv = "inventory.csv",
show_tree = True,
export_tree = "tree.txt",
)
print(f"Files : {result['total_files']:,}")
print(f"Duration : {result['duration_seconds']:.3f}s")
print(f"Dup groups : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")
print("Written : audit.json inventory.csv tree.txt")
Changelog
v1.5.0 (current)
- Added
return_files— per-file list in result dict withpath,size,ext,category,mtime - Added
export_csv— per-file inventory as UTF-8 CSV, zero dependencies, RFC 4180-compliant quoting - Added
largest_n— top-N files by size using per-thread min-heap, O(log N) per file - Added
find_duplicates— size-bucket + CRC32 duplicate detection, zero I/O for unique-size files - Added
regex_filter— path pattern filter; POSIXregexecon Linux/macOS (no GIL), Pythonrefallback on Windows - Added
FILEARRAY_INIT_CAP(65536) pre-allocation per thread — zero reallocations for typical scans - Fixed
fstataton Linux called only when needed — two separate guards for type resolution vs. size/mtime collection - Fixed
sorted_toppaths arestrdup'd independently fromglobal_heap— no lifetime overlap, no double-free - Removed
export_excel— was crashing on Windows due toopenpyxlWorkbook.read_onlyexception; useexport_csv+pandas.to_excel()instead - Improved Full docstring on
anscom.scanaccessible viahelp(anscom.scan)
v1.3.0
- Added
export_json,export_excel,export_tree - Fixed DFS tree output ordering
- Added file tracking in tree mode
v1.2.0 and earlier
- Multi-threaded worker pool with condition-variable termination detection
getdents64direct syscall on Linux- Per-thread statistics, zero shared state during scan
ignore_junk,min_size,extensions,callback,silent
License
MIT License. Free for personal and commercial use.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anscom-1.5.0.tar.gz.
File metadata
- Download URL: anscom-1.5.0.tar.gz
- Upload date:
- Size: 56.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74af1b2a8939f8209daa9c60a8c3d539b1dd0044a4be0ac1f1ae525c44c96a7c
|
|
| MD5 |
82eb3ebbff07cbe40dc235560fd71806
|
|
| BLAKE2b-256 |
e0ecc1e4b46b97ee975c88cb88f001c35f4895e6f42390581e86f5cbd36a64d7
|
File details
Details for the file anscom-1.5.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: anscom-1.5.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 32.8 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fb31007d96e46420bb02b43636b3fbf944cf99bf841f12c618cc794bb246184
|
|
| MD5 |
3785a185f73d6da2c50be351f9458b42
|
|
| BLAKE2b-256 |
3c8857d8bca5db03765ae5129c2842277f32279193a312276b2e0916771c1412
|