Skip to main content

Modular Python tool for profiling files, analyzing directory structures, and inspecting image data

Project description

filoma logo

PyPI version Python versions License Ruff Actions status Documentation Status

Fast, multi-backend file/directory profiling and data preparation.

pip install filoma

InstallationDocumentationAgentic AnalysisInteractive CLIQuickstartCookbookRoboflow DemoSource Code

📖 New to Filoma? Check out the Cookbook for practical, copy-paste recipes for common tasks!


filoma helps you analyze file directory trees, inspect file metadata, and prepare your data for exploration. It can achieve this blazingly fast using the best available backend (Rust, fd, or pure Python) ⚡🍃

Whether you're auditing a machine-learning dataset, tracking down duplicates across terabytes, or just need a quick overview of what's in a directory — Filoma gives you the tools to go from raw folder structure to actionable insight in seconds.

Filoma Package Overview

Key Features

  • 🚀 High-Performance Backends: Automatic selection of Rust, fd, or Python for the best performance.
  • 📈 DataFrame Integration: Convert scan results to Polars (or pandas) DataFrames for powerful analysis.
  • 📊 Rich Directory Analysis: Get detailed statistics on file counts, extensions, sizes, and more.
  • 🔍 Smart File Search: Use regex and glob patterns to find files with FdFinder.
  • 🖼️ File/Image Profiling: Extract metadata and statistics from various file formats.
  • 🛡️ Dataset Integrity & Quality: Unified integrity checking for snapshots, manifests, and automated quality scans (corruption, duplicates, leakage, class balance). 📖 Data Integrity Guide →
  • 🧠 Agentic Analysis: Natural language interface for file discovery, deduplication, and metadata inspection. 📖 Filaraki Guide →
  • 🖥️ Interactive CLI: Beautiful terminal interface for filesystem exploration and DataFrame analysis. 📖 CLI Documentation →
  • 🌐 MCP Server: Expose all 21 filesystem tools to any MCP-compatible AI assistant (nanobot recommended). 📖 MCP Configuration →

🍃 Talk to your filesystem: filoma filaraki chat — ask questions about your data in plain English. Find duplicates, audit datasets, export HTML reports — all from one conversation. Try it →

🎯 Local AI in 10 seconds: curl -sL https://raw.githubusercontent.com/kalfasyan/filoma/main/scripts/install.sh | sh → Use with nanobot + Ollama for fully local filesystem analysis. Learn more →

Filoma Package Overview


⚡ Quick Start

filoma provides a unified API for filesystem analysis.

End-to-End Example: Folder → DataFrame → Insights

This is the core Filoma workflow in one place: scan a folder, build a rich dataframe, filter it, and extract quick insights.

import filoma as flm

dataset = "notebooks/Weeds-3"

# 1) Fast scan + high-level summary
analysis = flm.probe(dataset)
analysis.print_summary()

# 2) Build an enriched dataframe (paths, extension, sizes, ownership, timestamps, etc.)
df = flm.probe_to_df(dataset, enrich=True)

# 3) Narrow to image files and inspect distribution
images = df.filter_by_extension(["jpg", "png"])
print(images.extension_counts())
print(images.directory_counts().head(3))

# 4) Get the largest files quickly
largest = images.sort("size_bytes", descending=True).head(5)
print(largest.select(["path", "size_bytes"]))

This flow is typically the fastest way to move from raw folder structure to actionable dataset insight.

1. File & Image Profiling

Extract rich metadata and statistics from any file or image.

import filoma as flm

# Profile any file
info = flm.probe_file("README.md")
print(info)
📄 See Metadata Output
Filo(
    path=PosixPath('README.md'),
    size=12237,
    mode_str='-rw-rw-r--',
    owner='user',
    modified=datetime.datetime(2025, 12, 30, 22, 45, 53),
    is_file=True,
    ...
)

For images, probe_image automatically extracts shapes, types, and pixel statistics.

2. Directory Analysis

Scan entire directory trees in milliseconds. filoma automatically picks the fastest available backend (Rust → fd → Python).

# Analyze a directory
analysis = flm.probe('.')

# Print high-level summary
analysis.print_summary()
📂 See Directory Summary Table
 Directory Analysis: /project (🦀 Rust (Parallel)) - 0.60s
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                   ┃ Value                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Files              │ 57,225               │
│ Total Folders            │ 3,427                │
│ Total Size               │ 2,084.90 MB          │
│ Average Files per Folder │ 16.70                │
│ Maximum Depth            │ 14                   │
│ Empty Folders            │ 103                  │
│ Analysis Time            │ 0.60s                │
│ Processing Speed         │ 102,114 items/sec    │
└──────────────────────────┴──────────────────────┘
# Or get a detailed report with extensions and folder stats
analysis.print_report()
📊 See Detailed Directory Report
          File Extensions
┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ Extension  ┃ Count  ┃ Percentage ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ .py        │ 240    │ 12.8%      │
│ .jpg       │ 1,204  │ 64.2%      │
│ .json      │ 431    │ 23.0%      │
│ .svg       │ 28,674 │ 50.1%      │
└────────────┴────────┴────────────┘

          Common Folder Names
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Folder Name   ┃ Occurrences ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ src           │ 1           │
│ tests         │ 1           │
│ docs          │ 1           │
│ notebooks     │ 1           │
└───────────────┴─────────────┘

          Empty Folders (3 found)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Path                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ /project/data/raw/empty_set_A              │
│ /project/logs/old/unused                   │
│ /project/temp/scratch                      │
└────────────────────────────────────────────┘

3. DataFrame Analysis

Convert scan results to Polars DataFrames with filesystem-specific operations for filtering, grouping, and summarization.

# Scan and get an enriched filoma.DataFrame (Polars)
df = flm.probe_to_df('src', enrich=True)

# Filter and analyze
df.filter_by_extension([".py", ".rs"])
df.extension_counts()
df.directory_counts()
📊 See Enriched DataFrame Output
filoma.DataFrame with 2 rows
shape: (2, 18)
┌───────────────────┬───────┬────────┬───────────────┬───┬─────────┬───────┬────────┬────────┐
│ path              ┆ depth ┆ parent ┆ name          ┆ … ┆ inode   ┆ nlink ┆ sha256 ┆ xattrs │
│ ---               ┆ ---   ┆ ---    ┆ ---           ┆   ┆ ---     ┆ ---   ┆ ---    ┆ ---    │
│ str               ┆ i64   ┆ str    ┆ str           ┆   ┆ i64     ┆ i64   ┆ str    ┆ str    │
╞═══════════════════╪═══════╪════════╪═══════════════╪═══╪═════════╪═══════╪════════╪════════╡
│ src/async_scan.rs ┆ 1     ┆ src    ┆ async_scan.rs ┆ … ┆ 7601121 ┆ 1     ┆ null   ┆ {}     │
│ src/filoma        ┆ 1     ┆ src    ┆ filoma        ┆ … ┆ 7603126 ┆ 8     ┆ null   ┆ {}     │
└───────────────────┴───────┴────────┴───────────────┴───┴─────────┴───────┴────────┴────────┘

✨ Enriched columns: parent, name, stem, suffix, size_bytes, modified_time,
   created_time, is_file, is_dir, owner, group, mode_str, inode, nlink, sha256, xattrs, depth
🔍 See Operation Examples

extension_counts() — groups files by extension and returns counts.

shape: (3, 2)
┌────────────┬─────┐
│ extension  ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ .py        ┆ 240 │
│ .jpg       ┆ 124 │
│ .json      ┆ 43  │
└────────────┴─────┘

directory_counts() — summarizes file distribution across parent directories.

shape: (3, 2)
┌────────────┬─────┐
│ parent_dir ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ src/filoma ┆ 12  │
│ tests      ┆ 8   │
│ docs       ┆ 5   │
└────────────┴─────┘
  • Seamless Pandas Integration: Just use df.pandas for instant conversion.
  • Lazy Loading: import filoma is cheap; heavy dependencies load only when needed.

🗂️ Advanced Topics

Dataset Convenience Class

Use the Dataset class for orchestration of snapshotting, profiling, integrity checks, and AI interactions:

import filoma as flm

ds = flm.Dataset("./my_data")

# Snapshot, Quality Scan, and Deduplication
ds.snap(mode="deep")
ds.run_quality_scan()
ds.dedup()

# Get an enriched DataFrame of the dataset
df = ds.to_dataframe()
print(df.extension_counts())

# Agentic interaction with this specific dataset
ds.get_filaraki().run("Is there any class imbalance in my dataset?")

Dataset Integrity & Quality

Filoma provides a comprehensive suite for dataset validation (corruption, leaks, balance) and manifest integrity:

from filoma.core.verifier import DatasetVerifier
verifier = DatasetVerifier("./data")
verifier.run_all()
verifier.print_summary()

Deduplication

Find duplicate files, images (perceptual hash), or text files.

# Standard find
filoma dedup /path/to/dataset

# Cross-directory find
filoma dedup train/ valid/ --cross-dir

🍃 Agentic Analysis

Filaraki ("little leaf" / "little buddy" in Greek) is Filoma's agentic interface for natural language filesystem analysis. Available as an interactive chat CLI, programmatic API, or MCP server.

Filaraki Chat Interface

Interactive Chat CLI

The fastest way to get started is with the setup wizard, which configures your AI provider and writes a .env file:

bash scripts/setup_env.sh

Then start chatting:

filoma filaraki chat

💡 The .env file is automatically loaded — no need for --env-file or export commands.

Programmatic Usage

from filoma.filaraki import get_agent

agent = get_agent()
await agent.run("Create a dataframe from notebooks/Weeds-3 with enrichment")
await agent.run("Filter by extension: jpg, png")
await agent.run("Sort dataframe by size descending and show top 5")

AI Service Options

Filaraki supports multiple providers — pick whatever fits your setup:

Provider Requires Privacy
Ollama (default) ollama serve on localhost:11434 🔒 100% local
Mistral AI MISTRAL_API_KEY Cloud
Google Gemini GEMINI_API_KEY Cloud
OpenAI / OpenRouter / compatible FILOMA_FILARAKI_BASE_URL + OPENAI_API_KEY Cloud

🎯 Quick setup: Run bash scripts/setup_env.sh to configure any provider interactively.

📖 Full AI configuration guide →

🏠 Local AI Setup (Nanobot + Ollama)

Run Filoma Filaraki completely offline with local models via the MCP server:

curl -sL https://raw.githubusercontent.com/kalfasyan/filoma/main/scripts/install.sh | sh

This installs nanobot + Ollama with Filoma's 21 filesystem tools. No API keys, no cloud — everything stays on your machine.

📖 Full MCP Configuration Guide →

📊 One-Command Audit with HTML Report

Run a full audit and export a self-contained interactive HTML report in one prompt:

filoma filaraki chat
> perform an audit on /path/to/dataset and export an html report called audit.html
📝 What's in the report?
  • Score gauges for Hygiene and Migration Readiness
  • KPI strip showing file counts, duplicate groups, and space waste
  • Stage timing bars (integrity / hygiene / readiness)
  • Priority-tagged Next Actions — colour-coded high / medium / low
  • Duplicate evidence cards with exact file paths
  • Collapsible full JSON payload for deeper inspection

Export formats: html, json, md

MCP Server

Expose all 21 filesystem tools to any MCP-compatible client:

filoma mcp serve

📖 Browse all guides →


📊 Performance & Benchmarks

Backend Local SSD (1M files) Network (200K files)
🦀 Rust 7.3s — 136K files/sec 2.3s — 86K files/sec
Async 11.5s — 87K files/sec 2.8s — 70K files/sec
🐍 Python 35.5s — 28K files/sec 15.1s — 13K files/sec
python benchmarks/benchmark.py --path /your/directory -n 3 --backend profiling

📖 Full Benchmarks Guide →


License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0


Contributing

Contributions welcome! Please check the issues for planned features and bug reports.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filoma-1.12.5.tar.gz (713.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filoma-1.12.5-cp311-cp311-win_amd64.whl (498.9 kB view details)

Uploaded CPython 3.11Windows x86-64

filoma-1.12.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (669.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

filoma-1.12.5-cp311-cp311-macosx_11_0_arm64.whl (613.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file filoma-1.12.5.tar.gz.

File metadata

  • Download URL: filoma-1.12.5.tar.gz
  • Upload date:
  • Size: 713.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filoma-1.12.5.tar.gz
Algorithm Hash digest
SHA256 03c9bf6726eeca45036925427e4a5481e5f558b1cd3907d7178b482b22ec4420
MD5 ecc5043d8f15c9957778e4701735d120
BLAKE2b-256 05d9812fd9f2ed240fa793f47c435a2e1865c639110970caab43406924b15fbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.12.5.tar.gz:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.12.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: filoma-1.12.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 498.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filoma-1.12.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d75e22dfc1ebc8802b86b5362e198449d9a86c0c8e4eab7baf436c865f40847c
MD5 cf9809b934dc32204fd3770529144916
BLAKE2b-256 9cb86c8e6daa860b484a940abcee6d6b34f33795a643659b1f5b446046e518cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.12.5-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.12.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filoma-1.12.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af2b9437676410e352bc940176d7bb35730f13042b91fe0bab6bbd640a24cc5a
MD5 cf4a64c313530d8d6108bd41c492c6c6
BLAKE2b-256 9ae20ad2434d3f20000c2ec5a98be3217c0602c966a0d6b534664a02a98715b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.12.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filoma-1.12.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filoma-1.12.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bf796f09f161d103fa818c084b61d2602dc7cb3724612de736776cb4a8937c99
MD5 80aa795b4facba6ced2e1a10d281c909
BLAKE2b-256 81a5065eb03864e335b0d05a584b14bc10e96af12fa0b7cc6a098ff39ba8f6c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.12.5-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page