Skip to main content

Modular Python tool for profiling files, analyzing directory structures, and inspecting image data

Project description

filoma logo

PyPI version Code style: ruff Contributions welcome Tests

Fast, multi-backend Python tool for directory analysis and file profiling.

Analyze directory structures, profile files, and inspect image data with automatic performance optimization through Rust (rayon, tokio, walkdir), fd tool, or pure Python backends.


Documentation: Installation โ€ข Backends โ€ข Advanced Usage โ€ข Benchmarks

Source Code: https://github.com/filoma/filoma

Key Features

  • ๐Ÿš€ 3 Performance Backends - Automatic selection: Rust (~2.3x faster *), fd (competitive), Python (baseline)
  • ๐Ÿ“Š Directory Analysis - File counts, extensions, empty folders, depth distribution, size statistics
  • ๐Ÿ” Smart File Search - Advanced patterns with regex/glob support via FdFinder
  • ๐Ÿ“ˆ DataFrame Support - Build Polars DataFrames for advanced analysis and filtering
  • ๐Ÿ–ผ๏ธ Image Analysis - Profile .tif, .png, .npy, .zarr files with metadata and statistics
  • ๐Ÿ“ File Profiling - System metadata, permissions, timestamps, symlink analysis
  • ๐ŸŽจ Rich Terminal Output - Beautiful progress bars and formatted reports
  • ๐Ÿ”€ ML-Friendly Splits - Deterministic train/val/test splits grouped by path or filename tokens

* According to benchmarks


Quick Start

With just a few lines of code, you can analyze directories, convert results to DataFrames, and profile files and images.

# Install
uv add filoma  # or: pip install filoma

Scan a directory and inspect the typed result:

from filoma import probe

analysis = probe('.')
analysis.print_summary()

Output:

Directory Analysis: /project (๐Ÿฆ€ Rust (Parallel)) - 0.27s
Total Files: 17,330    Total Folders: 2,427    Analysis Time: 0.27 s

You can just as easily print a report of the full analysis:

analysis.print_report()

Convert your scan results to a Polars DataFrame for further exploration:

from filoma import probe_to_df

df = probe_to_df('.', use_rust=True)
print(df.select(['path','depth','is_file']).head(5))

Output (other columns omitted, e.g., parent, name, stem, suffix, size_bytes, modified_time, created_time, is_dir):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ path                   โ”‚ depthโ”‚ is_file โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ pyproject.toml         โ”‚ 1    โ”‚ True    โ”‚
โ”‚ scripts                โ”‚ 1    โ”‚ False   โ”‚
โ”‚ .pytest_cache          โ”‚ 1    โ”‚ False   โ”‚
โ”‚ .vscode                โ”‚ 1    โ”‚ False   โ”‚
โ”‚ Makefile               โ”‚ 1    โ”‚ True    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Profile individual files and images with one-liners, and get a dataclass with rich metadata:

from filoma import probe_file, probe_image

filo = probe_file('README.md')
print(filo.path, filo.size)  

img = probe_image('images/logo.png')
print(img.file_type, getattr(img, 'shape', None))

Output:

README.md 12.3 KB
png (1024, 256)

filo includes attributes like path, size, mode, owner, group, created, modified, is_dir, is_file, sha256, and more, while img includes file_type, shape, dtype, min, max, mean, nans, infs, and more.

This minimal surface area (probe, probe_to_df, probe_file, probe_image) covers most needs: typed outputs, optional DataFrame workflows, and built-in pretty printers โ€” ready for scripts, demos, and REPLs.

Going Deeper (lower-level APIs)

Super simple directory analysis

Analyze a directory in one line and inspect the returned dataclass, or print a summary or full report:

from filoma.directories import DirectoryProfiler

# Analyze a directory (returns DirectoryAnalysis object)
analysis = DirectoryProfiler(DirectoryProfilerConfig()).probe("/", max_depth=3)
analysis.print_summary()
analysis.print_report()

The DirectoryProfiler class offers extensive customization and control over backends, concurrency, and filtering. See advanced usage for details.

Network filesystems โ€” recommended approach

For NFS/SMB/cloud-fuse or other network-mounted filesystems, prefer a two-step strategy:

  1. Try fd with multithreading first: fast discovery with controlled parallelism often gives the best performance with fewer issues.
    • Example: DirectoryProfiler(DirectoryProfilerConfig(use_fd=True, threads=8)) or set search_backend='fd'.
  2. If you still need higher concurrency for high-latency mounts, enable the Rust async scanner as a secondary option (use_async=True) and tune network_concurrency, network_timeout_ms, and network_retries.

Short tips:

  • Start with use_fd + a modest threads (4โ€“16) and validate server load.
  • Use async only when fd + multithreading isn't sufficient for your latency profile.
  • Reduce concurrency if the server throttles or shows instability; increase timeout for very slow metadata calls.

Smart File Search

The FdFinder class provides advanced file searching with regex and glob support, leveraging the high-performance fd tool when available.

from filoma.directories import FdFinder

searcher = FdFinder()

# Find Python files
python_files = searcher.find_files(pattern=r"\.py$", max_depth=2)

# Find by multiple extensions
code_files = searcher.find_by_extension(['py', 'rs', 'js'], path=".")

# Glob patterns
config_files = searcher.find_files(pattern="*.{json,yaml}", use_glob=True)

DataFrame Analysis

filoma can build Polars DataFrames for advanced analysis and filtering, allowing you to leverage the full power of Polars for downstream tasks.

# Build DataFrame for advanced analysis
profiler = DirectoryProfiler(DirectoryProfilerConfig(build_dataframe=True))
result = profiler.probe(".")
df = profiler.get_dataframe(result)

# Add path components and probe
df = df.add_path_components().add_file_stats_cols()
python_files = df.filter_by_extension('.py')
df.save_csv("analysis.csv")

File & Image Profiling (one-liners)

File metadata and image analysis are easy with the top-level helpers:

import filoma
import numpy as np

# File profiling (returns Filo dataclass)
filo = filoma.probe_file("/path/to/file.txt", compute_hash=False)
print(filo.path, filo.size)
print(filo.to_dict())

# Image profiling from file (dispatches to PNG/NPY/TIF/ZARR profilers)
img_report = filoma.probe_image("/path/to/image.png")
print(img_report.file_type, img_report.shape)

# Or analyze a numpy array directly
arr = np.zeros((64, 64), dtype=np.uint8)
img_report2 = filoma.probe_image(arr)
print(img_report2.to_dict())

ML-Friendly Splitting

Deterministic train/val/test splits grouped by filename or path-derived features (prevents related files leaking across sets).

from filoma import probe_to_df, ml

# Create DataFrame from directory
df = probe_to_df('.') # DataFrame with 'path'
# A method can discover filename tokens that can be used for grouping
# e.g., 'sample1_imageA.png' -> token1='sample1', token2='imageA'
df = ml.discover_filename_features(df, sep='_', prefix=None)  # adds token1, token2, ...

# `auto_split` can now use these tokens to group files
train, val, test = ml.auto_split(df, train_val_test=(70,15,15))
print(len(train), len(val), len(test))

# Or group by parent folder instead (parts index -2)
train_p, val_p, test_p = ml.auto_split(df, how='parts', parts=(-2,), seed=42)

# You can also choose what return type you want (filoma, polars or pandas)
# with 'filoma' being the default, you can also make use of cool methods like `.add_file_stats_cols()`
# that uses the filoma file profiling under the hood
train_f, val_f, test_f = ml.auto_split(df, return_type='filoma')

Notes: hash-based & deterministic; if splits drift from the ratios requested, then a warning is logged. Use verbose=False to silence.
To see some example usage, check out the ml_examples notebook.

Performance

Automatic backend selection for optimal speed:

Backend Speed Use Case
๐Ÿฆ€ Rust ~70K files/sec Large directories, DataFrame building
๐Ÿ” fd ~46K files/sec Pattern matching, network filesystems
๐Ÿ Python ~30K files/sec Universal compatibility, reliable fallback

Cold cache benchmarks on NVMe SSD. See benchmarks for detailed methodology.

System directories: filoma automatically handles permission errors for directories like /proc, /sys.

Installation & Setup

See installation guide for:

  • Quick setup with uv/pip
  • Optional performance optimization (Rust/fd)
  • Verification and troubleshooting

Documentation

Project Structure

src/filoma/
โ”œโ”€โ”€ core/          # Backend integrations (fd, Rust)
โ”œโ”€โ”€ directories/   # Directory analysis with 3 backends
โ”œโ”€โ”€ files/         # File profiling and metadata
โ””โ”€โ”€ images/        # Image analysis (.tif, .png, .npy, .zarr)

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Contributing

Contributions welcome! Please check the issues for planned features and bug reports.


filoma - Fast, multi-backend file and directory analysis for Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filoma-1.7.3.tar.gz (561.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filoma-1.7.3-cp311-cp311-win_amd64.whl (383.9 kB view details)

Uploaded CPython 3.11Windows x86-64

filoma-1.7.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (565.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

filoma-1.7.3-cp311-cp311-macosx_11_0_arm64.whl (511.5 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file filoma-1.7.3.tar.gz.

File metadata

  • Download URL: filoma-1.7.3.tar.gz
  • Upload date:
  • Size: 561.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.7.3.tar.gz
Algorithm Hash digest
SHA256 cba47892b18414c76b6da2b07c6bd7c13236cc18294cd6ef652e6bedc54dffe1
MD5 b74adf064e716b1096860af5d4232386
BLAKE2b-256 ca3817e99ea806f76f5f08e3d0289cece454347c72cc9222d169aa2382092284

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.7.3.tar.gz:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.7.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: filoma-1.7.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 383.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.7.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8914c0bf5219c08c411a5aabf81c4f2ae8c5076a3bf73cc94493327e119cc6b7
MD5 0cd4573bd2e895ae720c6ac34e62e007
BLAKE2b-256 fc0d2518b1f162cb01036862e896ef53cf44dd59776c300eeecc987b5a4ef403

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.7.3-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.7.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filoma-1.7.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 048a04cb5c3f8a7c27c8c334314cae2fa2f32b9b08f41ad6cbbf86ec552cff8e
MD5 0236d8a1df8509f36e2eee43b510e8f7
BLAKE2b-256 2395bc610d0f96931b54732e8421c281e065f28752792e4430a58e232fa450ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.7.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.7.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filoma-1.7.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9cd32602a1e841fc67080db45f815113bdfe26d0ec2409dd09bdbdee5d231eae
MD5 fa71191b40fd4b2e16ca77526559730b
BLAKE2b-256 d4a81fb3a9ecb16621fd2419c14a6947afb7de8f3edb90085b103f6bfeba603d

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.7.3-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page