filoma

Modular Python tool for profiling files, analyzing directory structures, and inspecting image data

Project description

filoma logo

Fast, multi-backend file/directory profiling and data preparation.

pip install filoma

Installation • Documentation • Agentic Analysis • Interactive CLI • Quickstart • Cookbook • Roboflow Demo • Source Code

📖 New to Filoma? Check out the Cookbook for practical, copy-paste recipes for common tasks!

filoma helps you analyze file directory trees, inspect file metadata, and prepare your data for exploration. It can achieve this blazingly fast using the best available backend (Rust, fd, or pure Python) ⚡🍃

Filoma Package Overview

Key Features

🚀 High-Performance Backends: Automatic selection of Rust, fd, or Python for the best performance.
📈 DataFrame Integration: Convert scan results to Polars (or pandas) DataFrames for powerful analysis.
📊 Rich Directory Analysis: Get detailed statistics on file counts, extensions, sizes, and more.
🔍 Smart File Search: Use regex and glob patterns to find files with FdFinder.
🖼️ File/Image Profiling: Extract metadata and statistics from various file formats.
🛡️ Dataset Integrity & Quality: Unified integrity checking for snapshots, manifests, and automated quality scans (corruption, duplicates, leakage, class balance). 📖 Data Integrity Guide →
🧠 Agentic Analysis: Natural language interface for file discovery, deduplication, and metadata inspection. 📖 Brain Guide →
🖥️ Interactive CLI: Beautiful terminal interface for filesystem exploration and DataFrame analysis. 📖 CLI Documentation →

Filoma Package Overview

⚡ Quick Start

filoma provides a unified API for filesystem analysis.

End-to-End Example: Folder → DataFrame → Insights

This is the core Filoma workflow in one place: scan a folder, build a rich dataframe, filter it, and extract quick insights.

import filoma as flm

dataset = "notebooks/Weeds-3"

# 1) Fast scan + high-level summary
analysis = flm.probe(dataset)
analysis.print_summary()

# 2) Build an enriched dataframe (paths, extension, sizes, ownership, timestamps, etc.)
df = flm.probe_to_df(dataset, enrich=True)

# 3) Narrow to image files and inspect distribution
images = df.filter_by_extension(["jpg", "png"])
print(images.extension_counts())
print(images.directory_counts().head(3))

# 4) Get the largest files quickly
largest = images.sort("size_bytes", descending=True).head(5)
print(largest.select(["path", "size_bytes"]))

This flow is typically the fastest way to move from raw folder structure to actionable dataset insight.

1. File & Image Profiling

Extract rich metadata and statistics from any file or image.

import filoma as flm

# Profile any file
info = flm.probe_file("README.md")
print(info)

📄 See Metadata Output

Filo(
    path=PosixPath('README.md'),
    size=12237,
    mode_str='-rw-rw-r--',
    owner='user',
    modified=datetime.datetime(2025, 12, 30, 22, 45, 53),
    is_file=True,
    ...
)

For images, probe_image automatically extracts shapes, types, and pixel statistics.

2. Directory Analysis

Scan entire directory trees in milliseconds. filoma automatically picks the fastest available backend (Rust → fd → Python).

# Analyze a directory
analysis = flm.probe('.')

# Print high-level summary
analysis.print_summary()

📂 See Directory Summary Table

 Directory Analysis: /project (🦀 Rust (Parallel)) - 0.60s
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                   ┃ Value                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ Total Files              │ 57,225               │
│ Total Folders            │ 3,427                │
│ Total Size               │ 2,084.90 MB          │
│ Average Files per Folder │ 16.70                │
│ Maximum Depth            │ 14                   │
│ Empty Folders            │ 103                  │
│ Analysis Time            │ 0.60s                │
│ Processing Speed         │ 102,114 items/sec    │
└──────────────────────────┴──────────────────────┘

# Or get a detailed report with extensions and folder stats
analysis.print_report()

📊 See Detailed Directory Report

          File Extensions
┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ Extension  ┃ Count  ┃ Percentage ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ .py        │ 240    │ 12.8%      │
│ .jpg       │ 1,204  │ 64.2%      │
│ .json      │ 431    │ 23.0%      │
│ .svg       │ 28,674 │ 50.1%      │
└────────────┴────────┴────────────┘

          Common Folder Names
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Folder Name   ┃ Occurrences ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ src           │ 1           │
│ tests         │ 1           │
│ docs          │ 1           │
│ notebooks     │ 1           │
└───────────────┴─────────────┘

          Empty Folders (3 found)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Path                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ /project/data/raw/empty_set_A              │
│ /project/logs/old/unused                   │
│ /project/temp/scratch                      │
└────────────────────────────────────────────┘

3. DataFrame Analysis

Convert scan results to Polars DataFrames for advanced analysis.

# Scan and get an enriched filoma.DataFrame (Polars)
df = flm.probe_to_df('src', enrich=True)

# Perform operations
df.filter_by_extension([".py", ".rs"])
df.directory_counts()

📊 See Enriched DataFrame Output

filoma.DataFrame with 2 rows
shape: (2, 18)
┌───────────────────┬───────┬────────┬───────────────┬───┬─────────┬───────┬────────┬────────┐
│ path              ┆ depth ┆ parent ┆ name          ┆ … ┆ inode   ┆ nlink ┆ sha256 ┆ xattrs │
│ ---               ┆ ---   ┆ ---    ┆ ---           ┆   ┆ ---     ┆ ---   ┆ ---    ┆ ---    │
│ str               ┆ i64   ┆ str    ┆ str           ┆   ┆ i64     ┆ i64   ┆ str    ┆ str    │
╞═══════════════════╪═══════╪════════╪═══════════════╪═══╪═════════╪═══════╪════════╪════════╡
│ src/async_scan.rs ┆ 1     ┆ src    ┆ async_scan.rs ┆ … ┆ 7601121 ┆ 1     ┆ null   ┆ {}     │
│ src/filoma        ┆ 1     ┆ src    ┆ filoma        ┆ … ┆ 7603126 ┆ 8     ┆ null   ┆ {}     │
└───────────────────┴───────┴────────┴───────────────┴───┴─────────┴───────┴────────┴────────┘

✨ Enriched columns added: parent, name, stem, suffix, size_bytes, modified_time,
   created_time, is_file, is_dir, owner, group, mode_str, inode, nlink, sha256, xattrs, depth

Seamless Pandas Integration: Just use df.pandas for instant conversion.
Lazy Loading: import filoma is cheap; heavy dependencies load only when needed.

4. Specialized DataFrame Operations

Filoma's DataFrame extends Polars with filesystem-specific operations for quick filtering and summarization.

# Filter by extensions
df.filter_by_extension([".py", ".rs"])

# Quick frequency analysis
df.extension_counts()
df.directory_counts()

🔍 See Operation Examples

filter_by_extension([".py", ".rs"])

shape: (3, 1)
┌─────────────────────┐
│ path                │
│ ---                 │
│ str                 │
╞═════════════════════╡
│ src/async_scan.rs   │
│ src/lib.rs          │
│ src/filoma/dedup.py │
└─────────────────────┘

extension_counts() — groups files by extension and returns counts.

shape: (3, 2)
┌────────────┬─────┐
│ extension  ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ .py        ┆ 240 │
│ .jpg       ┆ 124 │
│ .json      ┆ 43  │
└────────────┴─────┘

directory_counts() — summarizes file distribution across parent directories.

shape: (3, 2)
┌────────────┬─────┐
│ parent_dir ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ src/filoma ┆ 12  │
│ tests      ┆ 8   │
│ docs       ┆ 5   │
└────────────┴─────┘

🗂️ Advanced Topics

Dataset Convenience Class

Use the Dataset class for orchestration of snapshotting, profiling, integrity checks, and AI interactions:

import filoma as flm

ds = flm.Dataset("./my_data")

# Snapshot, Quality Scan, and Deduplication
ds.snap(mode="deep")
ds.run_quality_scan()
ds.dedup()

# Get an enriched DataFrame of the dataset
df = ds.to_dataframe()
print(df.extension_counts())

# Agentic interaction with this specific dataset
ds.get_brain().run("Is there any class imbalance in my dataset?")

Dataset Integrity & Quality

Filoma provides a comprehensive suite for dataset validation (corruption, leaks, balance) and manifest integrity:

from filoma.core.verifier import DatasetVerifier
verifier = DatasetVerifier("./data")
verifier.run_all()
verifier.print_summary()

Deduplication

Find duplicate files, images (perceptual hash), or text files.

# Standard find
filoma dedup /path/to/dataset

# Cross-directory find
filoma dedup train/ valid/ --cross-dir

Agentic Analysis

Connect a "brain" to your filesystem for natural language interaction:

from filoma.brain import get_agent

agent = get_agent()
await agent.run("Create a dataframe from notebooks/Weeds-3 with enrichment")
await agent.run("Filter by extension: jpg, png")
await agent.run("Summarize dataframe and show top directories")
await agent.run("Sort dataframe by size descending and show top 5")

Or use the interactive chat CLI:

filoma brain chat
# Then ask:
# - create a dataframe from notebooks/Weeds-3
# - filter by extension jpg,png
# - summarize dataframe
# - export dataframe to weeds_images.csv

Interactive CLI

filoma brain chat

📖 Browse all guides →

📊 Performance & Benchmarks

Need to compare backend performance? Check out the comprehensive Benchmarks Guide!

Local SSD (1M files):

🦀 Rust: 7.3s (136K files/sec)
⚡ Async: 11.5s (87K files/sec)
🐍 Python: 35.5s (28K files/sec)

Network Storage (200K files, cold cache):

🦀 Rust: 2.3s (86K files/sec)
⚡ Async: 2.8s (70K files/sec)
🐍 Python: 15.1s (13K files/sec)

python benchmarks/benchmark.py --path /your/directory -n 3 --backend profiling

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contributing

Contributions welcome! Please check the issues for planned features and bug reports.

Project details

Release history Release notifications | RSS feed

1.12.4

Apr 18, 2026

1.12.3

Apr 6, 2026

1.12.2

Apr 6, 2026

1.12.1

Apr 6, 2026

1.12.0

Apr 5, 2026

1.11.18

Apr 5, 2026

1.11.17

Apr 5, 2026

1.11.16

Apr 5, 2026

1.11.12

Apr 5, 2026

This version

1.11.11

Mar 23, 2026

1.11.10

Mar 16, 2026

1.11.7

Mar 11, 2026

1.11.6

Mar 4, 2026

1.11.5

Feb 15, 2026

1.11.4

Feb 12, 2026

1.11.1

Feb 1, 2026

1.11.0

Feb 1, 2026

1.10.2

Dec 30, 2025

1.10.1

Dec 30, 2025

1.10.0

Dec 30, 2025

1.9.6

Dec 2, 2025

1.9.5

Dec 1, 2025

1.9.4

Nov 30, 2025

1.9.3

Nov 29, 2025

1.9.2

Sep 21, 2025

1.9.1

Sep 21, 2025

1.9.0

Sep 21, 2025

1.8.1

Sep 20, 2025

1.8.0

Sep 20, 2025

1.7.7

Sep 20, 2025

1.7.6

Sep 19, 2025

1.7.5

Sep 13, 2025

1.7.4

Sep 13, 2025

1.7.3

Sep 10, 2025

1.7.2

Sep 7, 2025

1.7.1

Sep 7, 2025

1.7.0

Sep 7, 2025

1.6.2

Sep 7, 2025

1.6.1

Sep 6, 2025

1.4.0

Sep 4, 2025

1.3.4

Sep 3, 2025

1.3.3

Aug 15, 2025

1.3.2

Aug 15, 2025

1.3.1

Aug 15, 2025

1.3.0

Jul 8, 2025

1.2.1

Jul 7, 2025

1.2.0

Jul 6, 2025

1.1.1

Jul 6, 2025

1.1.0

Jul 6, 2025

1.0.3

Jul 6, 2025

1.0.2

Jul 6, 2025

0.1.3

Jul 5, 2025

0.1.2

Jul 5, 2025

0.1.0

Jul 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filoma-1.11.11.tar.gz (576.3 kB view details)

Uploaded Mar 23, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filoma-1.11.11-cp311-cp311-win_amd64.whl (475.0 kB view details)

Uploaded Mar 23, 2026 CPython 3.11Windows x86-64

filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (651.4 kB view details)

Uploaded Mar 23, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl (595.5 kB view details)

Uploaded Mar 23, 2026 CPython 3.11macOS 11.0+ ARM64

File details

Details for the file filoma-1.11.11.tar.gz.

File metadata

Download URL: filoma-1.11.11.tar.gz
Upload date: Mar 23, 2026
Size: 576.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.11.11.tar.gz
Algorithm	Hash digest
SHA256	`1a4644a24a0e775481b66a4577a28452c7f0cf15aff02744dc57ea82df9ee1a1`
MD5	`046261f690fe07bc891e4a7f25c44ad7`
BLAKE2b-256	`aeee67ee08afc4847f8a9717493bd36678b87f92db94ddb2e4a0c00698b71ae6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.11.11.tar.gz:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filoma-1.11.11.tar.gz
- Subject digest: 1a4644a24a0e775481b66a4577a28452c7f0cf15aff02744dc57ea82df9ee1a1
- Sigstore transparency entry: 1157712523
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: kalfasyan/filoma@adbfcb5964ac46ef4b077c363e1051fdea497735
- Branch / Tag: refs/tags/v1.11.11
- Owner: https://github.com/kalfasyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@adbfcb5964ac46ef4b077c363e1051fdea497735
- Trigger Event: push

File details

Details for the file filoma-1.11.11-cp311-cp311-win_amd64.whl.

File metadata

Download URL: filoma-1.11.11-cp311-cp311-win_amd64.whl
Upload date: Mar 23, 2026
Size: 475.0 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.11.11-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`c522264fe9b2eae3603dc336af24b400356b94d12a58bc98f7fec7675eaf24d9`
MD5	`cc64f51a16a0a16d086f9b44d03a0d9f`
BLAKE2b-256	`1640c2fa4307fdb86d369b7f6cb78a6bbf15bb8b1684744937dc4f59d79511e3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.11.11-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filoma-1.11.11-cp311-cp311-win_amd64.whl
- Subject digest: c522264fe9b2eae3603dc336af24b400356b94d12a58bc98f7fec7675eaf24d9
- Sigstore transparency entry: 1157712595
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: kalfasyan/filoma@adbfcb5964ac46ef4b077c363e1051fdea497735
- Branch / Tag: refs/tags/v1.11.11
- Owner: https://github.com/kalfasyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@adbfcb5964ac46ef4b077c363e1051fdea497735
- Trigger Event: push

File details

Details for the file filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 23, 2026
Size: 651.4 kB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`0277484fa24af20e9c5a8e498bcf044454d6ded8ea85caed3da0325106bb5acc`
MD5	`f171eb16267b27117f3387eecbe279c3`
BLAKE2b-256	`ce6dd7964bb2537ea3120b34081763158ae2f8a36f5b298fb424d068aad8dacd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filoma-1.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 0277484fa24af20e9c5a8e498bcf044454d6ded8ea85caed3da0325106bb5acc
- Sigstore transparency entry: 1157712568
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: kalfasyan/filoma@adbfcb5964ac46ef4b077c363e1051fdea497735
- Branch / Tag: refs/tags/v1.11.11
- Owner: https://github.com/kalfasyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@adbfcb5964ac46ef4b077c363e1051fdea497735
- Trigger Event: push

File details

Details for the file filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl
Upload date: Mar 23, 2026
Size: 595.5 kB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`b2f6d1c8f15f2d13acb02877860af0bd860708b845061a081b0a0a50a84f41d1`
MD5	`49f88dea254f03d92593cef12f60d4ca`
BLAKE2b-256	`fa2b973ce466d0ce8ae9258c7cc15464fb189e1c7de3e09e54511d34cde62294`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on kalfasyan/filoma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filoma-1.11.11-cp311-cp311-macosx_11_0_arm64.whl
- Subject digest: b2f6d1c8f15f2d13acb02877860af0bd860708b845061a081b0a0a50a84f41d1
- Sigstore transparency entry: 1157712544
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: kalfasyan/filoma@adbfcb5964ac46ef4b077c363e1051fdea497735
- Branch / Tag: refs/tags/v1.11.11
- Owner: https://github.com/kalfasyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@adbfcb5964ac46ef4b077c363e1051fdea497735
- Trigger Event: push

filoma 1.11.11

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Key Features

⚡ Quick Start

End-to-End Example: Folder → DataFrame → Insights

1. File & Image Profiling

2. Directory Analysis

3. DataFrame Analysis

4. Specialized DataFrame Operations

🗂️ Advanced Topics

Dataset Convenience Class

Dataset Integrity & Quality

Deduplication

Agentic Analysis

Interactive CLI

📊 Performance & Benchmarks

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance