Modular Python tool for profiling files, analyzing directory structures, and inspecting image data
Project description
Fast, multi-backend Python tool for directory analysis and file profiling.
Analyze directory structures, profile files, and inspect image data with automatic performance optimization through Rust (rayon, tokio, walkdir), fd tool, or pure Python backends.
Documentation: Installation โข Backends โข Advanced Usage โข Benchmarks
Source Code: https://github.com/filoma/filoma
Key Features
- ๐ 3 Performance Backends - Automatic selection: Rust (~2.3x faster *), fd (competitive), Python (baseline)
- ๐ Directory Analysis - File counts, extensions, empty folders, depth distribution, size statistics
- ๐ Smart File Search - Advanced patterns with regex/glob support via FdFinder
- ๐ DataFrame Support - Build Polars DataFrames for advanced analysis and filtering
- ๐ผ๏ธ Image Analysis - Profile .tif, .png, .npy, .zarr files with metadata and statistics
- ๐ File Profiling - System metadata, permissions, timestamps, symlink analysis
- ๐จ Rich Terminal Output - Beautiful progress bars and formatted reports
- ๐ ML-Friendly Splits - Deterministic train/val/test splits grouped by path or filename tokens
* According to benchmarks
Quick Start
With just a few lines of code, you can analyze directories, convert results to DataFrames, and profile files and images.
# Install
uv add filoma # or: pip install filoma
Scan a directory and inspect the typed result:
from filoma import probe
analysis = probe('.')
analysis.print_summary()
Output:
Directory Analysis: /project (๐ฆ Rust (Parallel)) - 0.27s
Total Files: 17,330 Total Folders: 2,427 Analysis Time: 0.27 s
You can just as easily print a report of the full analysis:
analysis.print_report()
Convert your scan results to a Polars DataFrame for further exploration:
from filoma import probe_to_df
df = probe_to_df('.', use_rust=True)
print(df.select(['path','depth','is_file']).head(5))
Output (other columns omitted, e.g., parent, name, stem, suffix, size_bytes, modified_time, created_time, is_dir):
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโ
โ path โ depthโ is_file โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโค
โ pyproject.toml โ 1 โ True โ
โ scripts โ 1 โ False โ
โ .pytest_cache โ 1 โ False โ
โ .vscode โ 1 โ False โ
โ Makefile โ 1 โ True โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโ
Profile individual files and images with one-liners, and get a dataclass with rich metadata:
from filoma import probe_file, probe_image
filo = probe_file('README.md')
print(filo.path, filo.size)
img = probe_image('images/logo.png')
print(img.file_type, getattr(img, 'shape', None))
Output:
README.md 12.3 KB
png (1024, 256)
filoincludes attributes likepath,size,mode,owner,group,created,modified,is_dir,is_file,sha256, and more, whileimgincludesfile_type,shape,dtype,min,max,mean,nans,infs, and more.
This minimal surface area (probe, probe_to_df, probe_file, probe_image) covers most needs: typed outputs, optional DataFrame workflows, and built-in pretty printers โ ready for scripts, demos, and REPLs.
Going Deeper (lower-level APIs)
Super simple directory analysis
Analyze a directory in one line and inspect the returned dataclass, or print a summary or full report:
from filoma.directories import DirectoryProfiler
# Analyze a directory (returns DirectoryAnalysis object)
analysis = DirectoryProfiler().probe("/", max_depth=3)
analysis.print_summary()
analysis.print_report()
The DirectoryProfiler class offers extensive customization and control over backends, concurrency, and filtering. See advanced usage for details.
Network filesystems โ recommended approach
For NFS/SMB/cloud-fuse or other network-mounted filesystems, prefer a two-step strategy:
- Try
fdwith multithreading first: fast discovery with controlled parallelism often gives the best performance with fewer issues.- Example:
DirectoryProfiler(use_fd=True, threads=8)or setsearch_backend='fd'.
- Example:
- If you still need higher concurrency for high-latency mounts, enable the Rust async scanner as a secondary option (
use_async=True) and tunenetwork_concurrency,network_timeout_ms, andnetwork_retries.
Short tips:
- Start with
use_fd+ a modestthreads(4โ16) and validate server load. - Use async only when fd + multithreading isn't sufficient for your latency profile.
- Reduce concurrency if the server throttles or shows instability; increase timeout for very slow metadata calls.
Smart File Search
The FdFinder class provides advanced file searching with regex and glob support, leveraging the high-performance fd tool when available.
from filoma.directories import FdFinder
searcher = FdFinder()
# Find Python files
python_files = searcher.find_files(pattern=r"\.py$", max_depth=2)
# Find by multiple extensions
code_files = searcher.find_by_extension(['py', 'rs', 'js'], path=".")
# Glob patterns
config_files = searcher.find_files(pattern="*.{json,yaml}", use_glob=True)
DataFrame Analysis
filoma can build Polars DataFrames for advanced analysis and filtering, allowing you to leverage the full power of Polars for downstream tasks.
# Build DataFrame for advanced analysis
profiler = DirectoryProfiler(build_dataframe=True)
result = profiler.probe(".")
df = profiler.get_dataframe(result)
# Add path components and probe
df = df.add_path_components().add_file_stats_cols()
python_files = df.filter_by_extension('.py')
df.save_csv("analysis.csv")
File & Image Profiling (one-liners)
File metadata and image analysis are easy with the top-level helpers:
import filoma
import numpy as np
# File profiling (returns Filo dataclass)
filo = filoma.probe_file("/path/to/file.txt", compute_hash=False)
print(filo.path, filo.size)
print(filo.to_dict())
# Image profiling from file (dispatches to PNG/NPY/TIF/ZARR profilers)
img_report = filoma.probe_image("/path/to/image.png")
print(img_report.file_type, img_report.shape)
# Or analyze a numpy array directly
arr = np.zeros((64, 64), dtype=np.uint8)
img_report2 = filoma.probe_image(arr)
print(img_report2.to_dict())
ML-Friendly Splitting
Deterministic train/val/test splits grouped by filename or path-derived features (prevents related files leaking across sets).
from filoma import probe_to_df, ml
# Create DataFrame from directory
df = probe_to_df('.') # DataFrame with 'path'
# A method can discover filename tokens that can be used for grouping
# e.g., 'sample1_imageA.png' -> token1='sample1', token2='imageA'
df = ml.discover_filename_features(df, sep='_', prefix=None) # adds token1, token2, ...
# `auto_split` can now use these tokens to group files
train, val, test = ml.auto_split(df, train_val_test=(70,15,15))
print(len(train), len(val), len(test))
# Or group by parent folder instead (parts index -2)
train_p, val_p, test_p = ml.auto_split(df, how='parts', parts=(-2,), seed=42)
# You can also choose what return type you want (filoma, polars or pandas)
# with 'filoma' being the default, you can also make use of cool methods like `.add_file_stats_cols()`
# that uses the filoma file profiling under the hood
train_f, val_f, test_f = ml.auto_split(df, return_type='filoma')
Notes: hash-based & deterministic; if splits drift from the ratios requested, then a warning is logged. Use verbose=False to silence.
To see some example usage, check out the ml_examples notebook.
Performance
Automatic backend selection for optimal speed:
| Backend | Speed | Use Case |
|---|---|---|
| ๐ฆ Rust | ~70K files/sec | Large directories, DataFrame building |
| ๐ fd | ~46K files/sec | Pattern matching, network filesystems |
| ๐ Python | ~30K files/sec | Universal compatibility, reliable fallback |
Cold cache benchmarks on NVMe SSD. See benchmarks for detailed methodology.
System directories: filoma automatically handles permission errors for directories like /proc, /sys.
Installation & Setup
See installation guide for:
- Quick setup with uv/pip
- Optional performance optimization (Rust/fd)
- Verification and troubleshooting
Documentation
- Installation Guide - Setup and optimization
- Backend Architecture - How the multi-backend system works
- Advanced Usage - DataFrame analysis, pattern matching, backend control
- Performance Benchmarks - Detailed performance analysis and methodology
Project Structure
src/filoma/
โโโ core/ # Backend integrations (fd, Rust)
โโโ directories/ # Directory analysis with 3 backends
โโโ files/ # File profiling and metadata
โโโ images/ # Image analysis (.tif, .png, .npy, .zarr)
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Contributing
Contributions welcome! Please check the issues for planned features and bug reports.
filoma - Fast, multi-backend file and directory analysis for Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filoma-1.7.2.tar.gz.
File metadata
- Download URL: filoma-1.7.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d696488bd0d41cdb5097817a196167a17df9d077148f3f5e8409db1c47d13830
|
|
| MD5 |
745ccd0e83508866bd5552c95a10de04
|
|
| BLAKE2b-256 |
d88c49d41e589188c0216342bef32371422b0575881a5285c96f1944811d7856
|
Provenance
The following attestation bundles were made for filoma-1.7.2.tar.gz:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.7.2.tar.gz -
Subject digest:
d696488bd0d41cdb5097817a196167a17df9d077148f3f5e8409db1c47d13830 - Sigstore transparency entry: 482736341
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@1695e356977e6d6decdc576d24df0648bb5dc77d -
Branch / Tag:
refs/tags/v1.7.2 - Owner: https://github.com/kalfasyan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1695e356977e6d6decdc576d24df0648bb5dc77d -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.7.2-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: filoma-1.7.2-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 381.4 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3e48c9035983cc24633361ed69f7a09ef624ae0e6d601ac75e02110929ec69d
|
|
| MD5 |
1540fdee821b7cab7310290da749dd7e
|
|
| BLAKE2b-256 |
bf06ac460924ed5f923f17425fa0b99e635e0492cd9487ecd42af67821ecde9b
|
Provenance
The following attestation bundles were made for filoma-1.7.2-cp311-cp311-win_amd64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.7.2-cp311-cp311-win_amd64.whl -
Subject digest:
c3e48c9035983cc24633361ed69f7a09ef624ae0e6d601ac75e02110929ec69d - Sigstore transparency entry: 482736399
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@1695e356977e6d6decdc576d24df0648bb5dc77d -
Branch / Tag:
refs/tags/v1.7.2 - Owner: https://github.com/kalfasyan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1695e356977e6d6decdc576d24df0648bb5dc77d -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: filoma-1.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 563.3 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
832bb4456ee66aec8ea27ccaa78f5bede28bdc79bfa126ccb98e2ad424760b8d
|
|
| MD5 |
e787f039cdc280e34b2fac3aed05298d
|
|
| BLAKE2b-256 |
e8adc252419152a705bf8b294729d80951b3b149e94b8dca1888123258d21492
|
Provenance
The following attestation bundles were made for filoma-1.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.7.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
832bb4456ee66aec8ea27ccaa78f5bede28bdc79bfa126ccb98e2ad424760b8d - Sigstore transparency entry: 482736362
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@1695e356977e6d6decdc576d24df0648bb5dc77d -
Branch / Tag:
refs/tags/v1.7.2 - Owner: https://github.com/kalfasyan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1695e356977e6d6decdc576d24df0648bb5dc77d -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.7.2-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: filoma-1.7.2-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 508.8 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddeb9f42c103e0352c453e13ebb2a654704655f43bf8bbc4e9032c3ead441d53
|
|
| MD5 |
192d88cebf702d47ff476e1f33945280
|
|
| BLAKE2b-256 |
7795b4d96317a606220c97897a86d5eada75308c2793517c4c988fee77790b24
|
Provenance
The following attestation bundles were made for filoma-1.7.2-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.7.2-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
ddeb9f42c103e0352c453e13ebb2a654704655f43bf8bbc4e9032c3ead441d53 - Sigstore transparency entry: 482736378
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@1695e356977e6d6decdc576d24df0648bb5dc77d -
Branch / Tag:
refs/tags/v1.7.2 - Owner: https://github.com/kalfasyan
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1695e356977e6d6decdc576d24df0648bb5dc77d -
Trigger Event:
push
-
Statement type: