AI-powered intelligent file organizer — find duplicates, track versions, identify the real final draft
Project description
FileWise
AI-powered intelligent file organizer — find semantically similar files, trace document version evolution, and identify the "real final draft."
Unlike hash-based deduplication tools (czkawka, fdupes, rdfind), FileWise uses embedding similarity to recognize different versions of the same document, even when content has been edited, renamed, or scattered across directories.
Fully local — the embedding model runs on your machine. No data ever leaves your disk.
Quick Start
git clone https://github.com/Maobuchiyugutou/FileWise.git
cd FileWise
pip install -e ".[all]"
# Scan a directory
filewise scan ~/Documents
# AI-powered analysis — find similar files and version chains
filewise analyze ~/Documents
# Compare two files
filewise diff proposal_v1.md proposal_v2.md
# Smart rename — add version prefixes based on analysis
filewise rename ~/Documents # dry-run
filewise rename ~/Documents --apply # apply renames
# Natural language search — find files by describing what you want
filewise search "budget proposal" ~/Documents
# Find files similar to a specific file
filewise find-similar draft.md
Commands
| Command | Description |
|---|---|
filewise scan <dir> |
List files by format, show supported/unsupported counts |
filewise analyze <dir> |
Full AI pipeline: find similar files and version chains |
filewise diff <A> <B> |
Line-level content comparison between two files |
filewise rename <dir> |
Rename files to show version order (--apply to execute) |
filewise search <query> <dir> |
Natural language search with auto mode detection |
filewise find-similar <file> |
Find files semantically similar to a given file |
filewise evaluate <dir> |
Run algorithm accuracy tests against ground-truth scenarios |
filewise info |
System info and supported formats |
How It Works
Scan files → Extract text → Split into chunks → Generate embeddings
→ Cluster by similarity → Infer version chains → Display results
Version Chain Algorithm
Three-stage, multi-signal approach:
- Clustering (DBSCAN + hierarchical refinement) — group files by content similarity only, ignoring file names
- Ordering — determine version direction using:
- Content containment (primary): how much of A appears in B?
- Filename dates: extract
2025-04-17from filenames - Version patterns:
v1→v2,draft→final,第1版→第2版 - Modification time (secondary)
- Chain construction — topological sort with confidence tiers (HIGH / MEDIUM / LOW)
Special cases handled: very short files (substring matching), heavily rewritten documents (filename signal boost), format variants (same content, different extensions).
Evaluation
21 scenarios test the algorithm across typical edge cases (100% accuracy):
filewise evaluate tests/eval_scenarios
# 17 version chain scenarios (100%) + 4 search scenarios (R@5=100%)
Supported Formats
| Category | Extensions |
|---|---|
| Documents | .pdf, .docx, .doc, .odt |
| Text | .txt, .md, .markdown, .rst, .log |
| Code | .py, .js, .ts, .go, .rs, .java, .c, .cpp, .h, .sh, .sql |
| Config/Data | .json, .yaml, .yml, .toml, .xml, .csv, .tsv |
| Web | .html, .css |
Tech Stack
| Layer | Choice |
|---|---|
| Embedding | sentence-transformers + BAAI/bge-small-zh-v1.5 (Chinese/English) |
| Vector Store | ChromaDB (persistent, incremental) |
| Clustering | scikit-learn (DBSCAN + hierarchical refinement) |
| Document Parsing | python-docx, PyPDF2 (with text cache) |
| CLI | typer + rich |
| CI | GitHub Actions (pytest + ruff on every push) |
Roadmap
- File scanner
- Multi-format document parser
- Text chunking (paragraph-first)
- Embedding generation
- Vector storage (ChromaDB, persistent)
- Similarity clustering (DBSCAN + hierarchical)
- Version chain inference (multi-signal scoring)
- Content diff
- Format variant detection (same-name, different extension)
- Smart rename (version-aware file renaming)
- Natural language search (semantic + keyword hybrid)
- File-anchored similarity search (
find-similar) - Evaluation framework (18 scenarios, 100% accuracy)
- CI/CD pipeline (GitHub Actions)
- Incremental indexing (watchdog — auto-detect file changes)
- TUI interface (Textual, Yazi-like)
- PyPI package (
pip install filewise)
Requirements
- Python 3.10+
- ~100MB disk for embedding model (downloaded on first use, cached locally)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filewise_ai-0.1.0.tar.gz.
File metadata
- Download URL: filewise_ai-0.1.0.tar.gz
- Upload date:
- Size: 56.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59ea2bb53b64d19dcef968442a793e6d12020088fe058e602230dc80cba082ab
|
|
| MD5 |
b9925446d4399ead94ce2a5a26ec69ab
|
|
| BLAKE2b-256 |
4b7563ea29b84d405d7c621564edcfbf9f4fc8a1bafc1e312b7d4627d6c28dfa
|
File details
Details for the file filewise_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: filewise_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3aa4b1fa123de4becf9706d90bdedb7f67c5d3108150cfcb474d4d09036a82c2
|
|
| MD5 |
e9c9d9b7016318b0e882552a1c0c7d9c
|
|
| BLAKE2b-256 |
64f82af1e3e8a9414cf687e09c3e15210107b3231ca69cb7a213eb9bb75bb922
|