Modular Python tool for profiling files, analyzing directory structures, and inspecting image data
Project description
Fast, multi-backend file/directory profiling and data preparation.
pip install filoma
Installation • Documentation • Agentic Analysis • Interactive CLI • Quickstart • Cookbook • Roboflow Demo • Source Code
📖 New to Filoma? Check out the Cookbook for practical, copy-paste recipes for common tasks!
filoma helps you analyze file directory trees, inspect file metadata, and prepare your data for exploration. It can achieve this blazingly fast using the best available backend (Rust, fd, or pure Python) ⚡🍃
Key Features
- 🚀 High-Performance Backends: Automatic selection of Rust,
fd, or Python for the best performance. - 📈 DataFrame Integration: Convert scan results to Polars (or pandas) DataFrames for powerful analysis.
- 📊 Rich Directory Analysis: Get detailed statistics on file counts, extensions, sizes, and more.
- 🔍 Smart File Search: Use regex and glob patterns to find files with
FdFinder. - 🖼️ File/Image Profiling: Extract metadata and statistics from various file formats.
- 🛡️ Dataset Integrity & Quality: Unified integrity checking for snapshots, manifests, and automated quality scans (corruption, duplicates, leakage, class balance). 📖 Data Integrity Guide →
- 🧠 Agentic Analysis: Natural language interface for file discovery, deduplication, and metadata inspection. 📖 Filaraki Guide →
- 🖥️ Interactive CLI: Beautiful terminal interface for filesystem exploration and DataFrame analysis. 📖 CLI Documentation →
- 🌐 MCP Server: Expose all 21 filesystem tools to any MCP-compatible AI assistant (nanobot recommended). 📖 MCP Configuration →
🎯 Local AI in 10 seconds:
curl -sL https://raw.githubusercontent.com/kalfasyan/filoma/main/scripts/install.sh | sh→ Use with nanobot + Ollama for fully local filesystem analysis. Learn more →
⚡ Quick Start
filoma provides a unified API for filesystem analysis.
End-to-End Example: Folder → DataFrame → Insights
This is the core Filoma workflow in one place: scan a folder, build a rich dataframe, filter it, and extract quick insights.
import filoma as flm
dataset = "notebooks/Weeds-3"
# 1) Fast scan + high-level summary
analysis = flm.probe(dataset)
analysis.print_summary()
# 2) Build an enriched dataframe (paths, extension, sizes, ownership, timestamps, etc.)
df = flm.probe_to_df(dataset, enrich=True)
# 3) Narrow to image files and inspect distribution
images = df.filter_by_extension(["jpg", "png"])
print(images.extension_counts())
print(images.directory_counts().head(3))
# 4) Get the largest files quickly
largest = images.sort("size_bytes", descending=True).head(5)
print(largest.select(["path", "size_bytes"]))
This flow is typically the fastest way to move from raw folder structure to actionable dataset insight.
1. File & Image Profiling
Extract rich metadata and statistics from any file or image.
import filoma as flm
# Profile any file
info = flm.probe_file("README.md")
print(info)
📄 See Metadata Output
Filo(
path=PosixPath('README.md'),
size=12237,
mode_str='-rw-rw-r--',
owner='user',
modified=datetime.datetime(2025, 12, 30, 22, 45, 53),
is_file=True,
...
)
For images, probe_image automatically extracts shapes, types, and pixel statistics.
2. Directory Analysis
Scan entire directory trees in milliseconds. filoma automatically picks the fastest available backend (Rust → fd → Python).
# Analyze a directory
analysis = flm.probe('.')
# Print high-level summary
analysis.print_summary()
📂 See Directory Summary Table
Directory Analysis: /project (🦀 Rust (Parallel)) - 0.60s
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Files │ 57,225 │
│ Total Folders │ 3,427 │
│ Total Size │ 2,084.90 MB │
│ Average Files per Folder │ 16.70 │
│ Maximum Depth │ 14 │
│ Empty Folders │ 103 │
│ Analysis Time │ 0.60s │
│ Processing Speed │ 102,114 items/sec │
└──────────────────────────┴──────────────────────┘
# Or get a detailed report with extensions and folder stats
analysis.print_report()
📊 See Detailed Directory Report
File Extensions
┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ Extension ┃ Count ┃ Percentage ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ .py │ 240 │ 12.8% │
│ .jpg │ 1,204 │ 64.2% │
│ .json │ 431 │ 23.0% │
│ .svg │ 28,674 │ 50.1% │
└────────────┴────────┴────────────┘
Common Folder Names
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Folder Name ┃ Occurrences ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ src │ 1 │
│ tests │ 1 │
│ docs │ 1 │
│ notebooks │ 1 │
└───────────────┴─────────────┘
Empty Folders (3 found)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Path ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ /project/data/raw/empty_set_A │
│ /project/logs/old/unused │
│ /project/temp/scratch │
└────────────────────────────────────────────┘
3. DataFrame Analysis
Convert scan results to Polars DataFrames for advanced analysis.
# Scan and get an enriched filoma.DataFrame (Polars)
df = flm.probe_to_df('src', enrich=True)
# Perform operations
df.filter_by_extension([".py", ".rs"])
df.directory_counts()
📊 See Enriched DataFrame Output
filoma.DataFrame with 2 rows
shape: (2, 18)
┌───────────────────┬───────┬────────┬───────────────┬───┬─────────┬───────┬────────┬────────┐
│ path ┆ depth ┆ parent ┆ name ┆ … ┆ inode ┆ nlink ┆ sha256 ┆ xattrs │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ str ┆ str │
╞═══════════════════╪═══════╪════════╪═══════════════╪═══╪═════════╪═══════╪════════╪════════╡
│ src/async_scan.rs ┆ 1 ┆ src ┆ async_scan.rs ┆ … ┆ 7601121 ┆ 1 ┆ null ┆ {} │
│ src/filoma ┆ 1 ┆ src ┆ filoma ┆ … ┆ 7603126 ┆ 8 ┆ null ┆ {} │
└───────────────────┴───────┴────────┴───────────────┴───┴─────────┴───────┴────────┴────────┘
✨ Enriched columns added: parent, name, stem, suffix, size_bytes, modified_time,
created_time, is_file, is_dir, owner, group, mode_str, inode, nlink, sha256, xattrs, depth
- Seamless Pandas Integration: Just use
df.pandasfor instant conversion. - Lazy Loading:
import filomais cheap; heavy dependencies load only when needed.
4. Specialized DataFrame Operations
Filoma's DataFrame extends Polars with filesystem-specific operations for quick filtering and summarization.
# Filter by extensions
df.filter_by_extension([".py", ".rs"])
# Quick frequency analysis
df.extension_counts()
df.directory_counts()
🔍 See Operation Examples
filter_by_extension([".py", ".rs"])
shape: (3, 1)
┌─────────────────────┐
│ path │
│ --- │
│ str │
╞═════════════════════╡
│ src/async_scan.rs │
│ src/lib.rs │
│ src/filoma/dedup.py │
└─────────────────────┘
extension_counts() — groups files by extension and returns counts.
shape: (3, 2)
┌────────────┬─────┐
│ extension ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═════╡
│ .py ┆ 240 │
│ .jpg ┆ 124 │
│ .json ┆ 43 │
└────────────┴─────┘
directory_counts() — summarizes file distribution across parent directories.
shape: (3, 2)
┌────────────┬─────┐
│ parent_dir ┆ len │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═════╡
│ src/filoma ┆ 12 │
│ tests ┆ 8 │
│ docs ┆ 5 │
└────────────┴─────┘
🗂️ Advanced Topics
Dataset Convenience Class
Use the Dataset class for orchestration of snapshotting, profiling, integrity checks, and AI interactions:
import filoma as flm
ds = flm.Dataset("./my_data")
# Snapshot, Quality Scan, and Deduplication
ds.snap(mode="deep")
ds.run_quality_scan()
ds.dedup()
# Get an enriched DataFrame of the dataset
df = ds.to_dataframe()
print(df.extension_counts())
# Agentic interaction with this specific dataset
ds.get_filaraki().run("Is there any class imbalance in my dataset?")
Dataset Integrity & Quality
Filoma provides a comprehensive suite for dataset validation (corruption, leaks, balance) and manifest integrity:
from filoma.core.verifier import DatasetVerifier
verifier = DatasetVerifier("./data")
verifier.run_all()
verifier.print_summary()
Deduplication
Find duplicate files, images (perceptual hash), or text files.
# Standard find
filoma dedup /path/to/dataset
# Cross-directory find
filoma dedup train/ valid/ --cross-dir
🍃 Agentic Analysis
Filaraki (stands for "little leaf" or "little buddy" in Greek) is Filoma's agentic interface for natural language filesystem analysis. It provides an agentic, flexible way to interact with your data using plain language commands.
🏠 Local AI Setup (Nanobot + Ollama)
Run Filoma Filaraki completely offline with local models via the MCP server:
# One-command setup
curl -sL https://raw.githubusercontent.com/kalfasyan/filoma/main/scripts/install.sh | sh
This installs nanobot and configures it to use Ollama with Filoma's 21 filesystem tools. No API keys, no cloud services—everything stays on your machine.
# After setup, chat with your filesystem
nanobot agent -m "How many images are in ./data?"
nanobot agent -m "Find duplicate files and show me the largest ones"
📖 Full MCP Configuration Guide →
Interactive Chat CLI
Start a chat session directly from your terminal:
filoma filaraki chat
Programmatic Usage
Use Python for scripted workflows:
from filoma.filaraki import get_agent
agent = get_agent()
await agent.run("Create a dataframe from notebooks/Weeds-3 with enrichment")
await agent.run("Filter by extension: jpg, png")
await agent.run("Summarize dataframe and show top directories")
await agent.run("Sort dataframe by size descending and show top 5")
Advanced Workflow Orchestration
Filoma Filaraki includes advanced orchestrator tools for enterprise-grade dataset analysis:
# Run advanced workflow examples
make filaraki-advanced
# Or in code:
await agent.run("Run a corrupted file audit on /path/to/dataset")
await agent.run("Generate a dataset hygiene report for /path/to/dataset")
await agent.run("Assess the migration readiness of /path/to/dataset")
These provide structured, deterministic reports with detailed findings, recommendations, and confidence scores.
MCP Server
Expose all 21 filesystem tools to any MCP-compatible client:
filoma mcp serve
📊 Performance & Benchmarks
Need to compare backend performance? Check out the comprehensive Benchmarks Guide!
Local SSD (1M files):
- 🦀 Rust: 7.3s (136K files/sec)
- ⚡ Async: 11.5s (87K files/sec)
- 🐍 Python: 35.5s (28K files/sec)
Network Storage (200K files, cold cache):
- 🦀 Rust: 2.3s (86K files/sec)
- ⚡ Async: 2.8s (70K files/sec)
- 🐍 Python: 15.1s (13K files/sec)
python benchmarks/benchmark.py --path /your/directory -n 3 --backend profiling
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Contributing
Contributions welcome! Please check the issues for planned features and bug reports.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filoma-1.12.2.tar.gz.
File metadata
- Download URL: filoma-1.12.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34bd74b5ab881d6a988cfd6663551f1030d527125d06fb49622fb91b3a4f3704
|
|
| MD5 |
319bf3a382f9253c3a9c8636917ee983
|
|
| BLAKE2b-256 |
e2b5a8d226b198cd7e881ccd08a029bb265d9c70175c702d138fcf2c3ba1f606
|
Provenance
The following attestation bundles were made for filoma-1.12.2.tar.gz:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.12.2.tar.gz -
Subject digest:
34bd74b5ab881d6a988cfd6663551f1030d527125d06fb49622fb91b3a4f3704 - Sigstore transparency entry: 1242948713
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Branch / Tag:
refs/tags/v1.12.2 - Owner: https://github.com/kalfasyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.12.2-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: filoma-1.12.2-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 494.3 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c0a5967950a2f4a41b4dedf6843ea290bc6f637d4909c2507085d3af950fb52
|
|
| MD5 |
577f8d7f872f516374f7b20cfc1b67a3
|
|
| BLAKE2b-256 |
c09b09945e85b3b9e46d38563ff8f8887dd1c4f6324626836c76dd480c8af225
|
Provenance
The following attestation bundles were made for filoma-1.12.2-cp311-cp311-win_amd64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.12.2-cp311-cp311-win_amd64.whl -
Subject digest:
3c0a5967950a2f4a41b4dedf6843ea290bc6f637d4909c2507085d3af950fb52 - Sigstore transparency entry: 1242948720
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Branch / Tag:
refs/tags/v1.12.2 - Owner: https://github.com/kalfasyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: filoma-1.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 671.0 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b115135c5fc38c3402a96bbccc2cef1338d9c8b08954cd03cde9ffe8e701d59c
|
|
| MD5 |
628690a09c03f809d268fb0b461be2ba
|
|
| BLAKE2b-256 |
36f3f0dd13d1c4d925ce03f58de2007f0b140f0d76390b765edc5fd661674132
|
Provenance
The following attestation bundles were made for filoma-1.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
b115135c5fc38c3402a96bbccc2cef1338d9c8b08954cd03cde9ffe8e701d59c - Sigstore transparency entry: 1242948726
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Branch / Tag:
refs/tags/v1.12.2 - Owner: https://github.com/kalfasyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filoma-1.12.2-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: filoma-1.12.2-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 614.3 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
667a45bf3ecbd75160898a80976d1825aff34a3805fee61f33e3fe981b731aca
|
|
| MD5 |
aa257f52050a12946487eb21ebd5c067
|
|
| BLAKE2b-256 |
a2a964d936d009680a7d6fba8e536098c7c1d257c2eb7ec3f9f490d05dc4db71
|
Provenance
The following attestation bundles were made for filoma-1.12.2-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish.yml on kalfasyan/filoma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filoma-1.12.2-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
667a45bf3ecbd75160898a80976d1825aff34a3805fee61f33e3fe981b731aca - Sigstore transparency entry: 1242948731
- Sigstore integration time:
-
Permalink:
kalfasyan/filoma@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Branch / Tag:
refs/tags/v1.12.2 - Owner: https://github.com/kalfasyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@15270607bc5a821d3c8e2191260e71bc3f4d61a2 -
Trigger Event:
push
-
Statement type: