Skip to main content

AI-powered intelligent file organizer — find duplicates, track versions, identify the real final draft

Project description

FileWise

English | 中文

Test Python License Tests

AI-powered intelligent file organizer — find semantically similar files, trace document version evolution, and identify the "real final draft."

Unlike hash-based deduplication tools (czkawka, fdupes, rdfind), FileWise uses embedding similarity to recognize different versions of the same document, even when content has been edited, renamed, or scattered across directories.

Fully local — the embedding model runs on your machine. No data ever leaves your disk.

Quick Start

git clone https://github.com/Maobuchiyugutou/FileWise.git
cd FileWise
pip install -e ".[all]"

# Scan a directory
filewise scan ~/Documents

# AI-powered analysis — find similar files and version chains
filewise analyze ~/Documents

# Compare two files
filewise diff proposal_v1.md proposal_v2.md

# Smart rename — add version prefixes based on analysis
filewise rename ~/Documents            # dry-run
filewise rename ~/Documents --apply    # apply renames

# Natural language search — find files by describing what you want
filewise search "budget proposal" ~/Documents

# Find files similar to a specific file
filewise find-similar draft.md

Commands

Command Description
filewise scan <dir> List files by format, show supported/unsupported counts
filewise analyze <dir> Full AI pipeline: find similar files and version chains
filewise diff <A> <B> Line-level content comparison between two files
filewise rename <dir> Rename files to show version order (--apply to execute)
filewise search <query> <dir> Natural language search with auto mode detection
filewise find-similar <file> Find files semantically similar to a given file
filewise evaluate <dir> Run algorithm accuracy tests against ground-truth scenarios
filewise info System info and supported formats

How It Works

Scan files → Extract text → Split into chunks → Generate embeddings
    → Cluster by similarity → Infer version chains → Display results

Version Chain Algorithm

Three-stage, multi-signal approach:

  1. Clustering (DBSCAN + hierarchical refinement) — group files by content similarity only, ignoring file names
  2. Ordering — determine version direction using:
    • Content containment (primary): how much of A appears in B?
    • Filename dates: extract 2025-04-17 from filenames
    • Version patterns: v1v2, draftfinal, 第1版第2版
    • Modification time (secondary)
  3. Chain construction — topological sort with confidence tiers (HIGH / MEDIUM / LOW)

Special cases handled: very short files (substring matching), heavily rewritten documents (filename signal boost), format variants (same content, different extensions).

Evaluation

21 scenarios test the algorithm across typical edge cases (100% accuracy):

filewise evaluate tests/eval_scenarios
# 17 version chain scenarios (100%) + 4 search scenarios (R@5=100%)

Supported Formats

Category Extensions
Documents .pdf, .docx, .doc, .odt
Text .txt, .md, .markdown, .rst, .log
Code .py, .js, .ts, .go, .rs, .java, .c, .cpp, .h, .sh, .sql
Config/Data .json, .yaml, .yml, .toml, .xml, .csv, .tsv
Web .html, .css

Tech Stack

Layer Choice
Embedding sentence-transformers + BAAI/bge-small-zh-v1.5 (Chinese/English)
Vector Store ChromaDB (persistent, incremental)
Clustering scikit-learn (DBSCAN + hierarchical refinement)
Document Parsing python-docx, PyPDF2 (with text cache)
CLI typer + rich
CI GitHub Actions (pytest + ruff on every push)

Roadmap

  • File scanner
  • Multi-format document parser
  • Text chunking (paragraph-first)
  • Embedding generation
  • Vector storage (ChromaDB, persistent)
  • Similarity clustering (DBSCAN + hierarchical)
  • Version chain inference (multi-signal scoring)
  • Content diff
  • Format variant detection (same-name, different extension)
  • Smart rename (version-aware file renaming)
  • Natural language search (semantic + keyword hybrid)
  • File-anchored similarity search (find-similar)
  • Evaluation framework (18 scenarios, 100% accuracy)
  • CI/CD pipeline (GitHub Actions)
  • Incremental indexing (watchdog — auto-detect file changes)
  • TUI interface (Textual, Yazi-like)
  • PyPI package (pip install filewise)

Requirements

  • Python 3.10+
  • ~100MB disk for embedding model (downloaded on first use, cached locally)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filewise_ai-0.1.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filewise_ai-0.1.0-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file filewise_ai-0.1.0.tar.gz.

File metadata

  • Download URL: filewise_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for filewise_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 59ea2bb53b64d19dcef968442a793e6d12020088fe058e602230dc80cba082ab
MD5 b9925446d4399ead94ce2a5a26ec69ab
BLAKE2b-256 4b7563ea29b84d405d7c621564edcfbf9f4fc8a1bafc1e312b7d4627d6c28dfa

See more details on using hashes here.

File details

Details for the file filewise_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: filewise_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for filewise_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3aa4b1fa123de4becf9706d90bdedb7f67c5d3108150cfcb474d4d09036a82c2
MD5 e9c9d9b7016318b0e882552a1c0c7d9c
BLAKE2b-256 64f82af1e3e8a9414cf687e09c3e15210107b3231ca69cb7a213eb9bb75bb922

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page