A lightning-fast, zero-cloud CLI utility that cleans your Windows directories by parsing file contents and visual data โ automatically destroying clones and generating intelligently named folder structures.
Project description
๐ Vortex
A lightning-fast, zero-cloud CLI utility that cleans your Windows directories by parsing file contents and visual data โ automatically destroying clones and generating intelligently named folder structures.
Table of Contents
- Overview
- Features
- Architecture
- Project Structure
- Tech Stack
- Supported File Formats
- Prerequisites
- Installation
- Usage
- How It Works
- Configuration
- Roadmap
- Contributing
- License
Overview
Vortex uses state-of-the-art embedding models to intelligently group your files. Whether you have text documents, code, PDFs, or images โ it semantically analyzes their content entirely locally. Your data never leaves your machine.
The Problem: Over time, directories accumulate duplicate files and an unstructured mess of documents, images, and code. Manually organizing them is tedious and error-prone.
The Solution: Vortex automates the entire process in a single command โ deduplicating exact clones via SHA-256, then semantically clustering the remaining files using vector embeddings and DBSCAN, and finally reorganizing them into auto-named folders derived from their content.
Features
๐ Intelligent Deduplication
- Recursive Directory Traversal โ Walks the entire target directory tree, resolving all nested files.
- SHA-256 Chunked Hashing โ Identifies exact binary duplicates efficiently, even for very large files, using 4 MB chunked reads.
- Storage Statistics โ Gathers file size metadata before acting.
๐ง Semantic Clustering
- Multi-Format Content Extraction โ Pulls text from plain text files, code, PDFs (via PyMuPDF), and images (via Tesseract OCR).
- Local Embedding Generation โ Generates text embeddings with
BAAI/bge-small-en-v1.5and image embeddings withQdrant/clip-ViT-B-32-vision, all viafastembed. - Vector Database โ Manages semantic indexes in a local Qdrant instance (on-disk, no server needed).
- DBSCAN Clustering โ Groups files by semantic proximity using cosine-distance DBSCAN with tuned
epsandmin_samplesper modality. - TF-IDF Directory Naming โ Automatically generates human-readable folder names by extracting the most relevant terms from each cluster's combined content.
๐ก๏ธ Human-in-the-Loop (HITL)
Implementation in progress.
An interactive review phase that will let you verify the proposed folder structure and directory names before Vortex commits changes to your filesystem.
Architecture
flowchart TD
A["๐๏ธ Target Directory"] --> B["Phase 1: Deduplication"]
subgraph DEDUP ["src/dedup"]
B --> B1["dirTraversal โ Recursive file discovery"]
B1 --> B2["fileHash โ SHA-256 chunked hashing"]
B2 --> B3["fileStats โ File size metadata"]
B3 --> B4{"Duplicates found?"}
B4 -- Yes --> B5["Remove exact clones"]
B4 -- No --> C
B5 --> C
end
C["Unique Files"] --> D["Phase 2: Clustering"]
subgraph CLUSTER ["src/clustering"]
D --> D1["fileContent โ Multi-format text extraction"]
D1 --> D2["embeddings โ Vector generation"]
D2 --> D2a["Text: BAAI/bge-small-en-v1.5 โ 384d"]
D2 --> D2b["Image: Qdrant/clip-ViT-B-32-vision โ 512d"]
D2a --> D3["Qdrant Vector DB โ Local on-disk storage"]
D2b --> D3
D3 --> D4["dbscanModel โ Cosine DBSCAN clustering"]
D4 --> D5["dirNaming โ TF-IDF folder name generation"]
end
D5 --> E["Phase 3: HITL Review"]
E --> F["๐ Organized Directory"]
Project Structure
Vortex/
โโโ main.py # CLI entry point (Typer + Rich)
โโโ pyproject.toml # Project metadata & dependencies
โโโ uv.lock # Locked dependency versions
โโโ .python-version # Python 3.12
โ
โโโ src/
โ โโโ dedup/ # Phase 1 โ Deduplication
โ โ โโโ dirTraversal.py # Recursive directory walker
โ โ โโโ fileHash.py # SHA-256 chunked file hashing
โ โ โโโ fileStats.py # File size statistics
โ โ
โ โโโ clustering/ # Phase 2 โ Semantic Clustering
โ โ โโโ fileContent.py # Multi-format content extraction
โ โ โโโ embeddings.py # Text & image embedding generation + Qdrant storage
โ โ โโโ dbscanModel.py # DBSCAN clustering with data retrieval from Qdrant
โ โ โโโ dirNaming.py # TF-IDF based directory name generation
โ โ
โ โโโ hitl/ # Phase 3 โ Human-in-the-Loop (WIP)
โ
โโโ docs/ # Documentation (WIP)
Tech Stack
| Category | Technology | Purpose |
|---|---|---|
| Language | Python โฅ 3.12 | Core runtime |
| CLI Framework | Typer + Rich | Command-line interface with styled output |
| Text Embeddings | FastEmbed (BAAI/bge-small-en-v1.5) |
384-dim text vectors |
| Image Embeddings | FastEmbed (Qdrant/clip-ViT-B-32-vision) |
512-dim image vectors |
| Vector Database | Qdrant (local on-disk mode) | Semantic index storage |
| Clustering | scikit-learn (DBSCAN) | Density-based grouping |
| PDF Parsing | PyMuPDF | Text extraction from PDFs |
| OCR | Tesseract + pytesseract | Image-to-text extraction |
| Image Processing | Pillow | Image loading for OCR |
| TF-IDF | scikit-learn (TfidfVectorizer) | Directory name generation |
| Package Manager | uv | Fast dependency management |
Supported File Formats
| Category | Extensions | Extraction Method |
|---|---|---|
| Plain Text / Code | .txt, .md, .csv, .json, .py, .js, .html |
Direct UTF-8 read |
| Documents | .pdf |
PyMuPDF text extraction |
| Images | .png, .jpg, .jpeg |
CLIP embeddings (visual) + Tesseract OCR (textual) |
Note: Images are embedded using CLIP for visual similarity clustering. OCR is used separately when text content is needed (e.g., for directory naming).
Prerequisites
- Python 3.12+ โ Download
- uv (recommended) โ Install
- Tesseract OCR โ Required for image text extraction.
- Download the Windows installer from UB Mannheim.
- Ensure
tesseract.exeis accessible via your systemPATH.
Installation
# Clone the repository
git clone https://github.com/44ompatil/Vortex.git
cd Vortex
# Install dependencies (uv recommended)
uv sync
# Or, using pip
pip install -e .
Usage
Vortex exposes a sort command that orchestrates the full pipeline:
# Sort a directory
uv run python main.py sort <target-directory>
# Example
uv run python main.py sort "C:\Users\you\Downloads"
Available Commands
| Command | Description |
|---|---|
sort <directory> |
Run the full dedup โ cluster โ organize pipeline on the target directory |
help |
Display available commands and usage information |
What Happens When You Run sort
- Scan โ Recursively discovers all files in the target directory.
- Dedup โ Identifies and removes exact binary duplicates (SHA-256).
- Extract โ Pulls text/visual content from each unique file.
- Embed โ Generates vector embeddings (text: 384d, image: 512d).
- Cluster โ Groups semantically similar files using DBSCAN.
- Name โ Generates descriptive folder names via TF-IDF.
- Organize โ Moves files into their newly created, named folders.
How It Works
Phase 1: Deduplication (src/dedup/)
Files are recursively discovered via os.walk. Each file is hashed using SHA-256 with 4 MB chunked reads to handle large files efficiently. Files sharing the same hash are identified as exact duplicates โ only one copy is kept, the rest are deleted.
Phase 2: Semantic Clustering (src/clustering/)
Remaining unique files have their content extracted based on file type. Text content is embedded into 384-dimensional vectors using BAAI/bge-small-en-v1.5, while images are embedded into 512-dimensional vectors using Qdrant/clip-ViT-B-32-vision. All vectors are stored in a local Qdrant database.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) then groups files by semantic similarity using cosine distance. The algorithm parameters are tuned separately for each modality:
| Modality | eps |
min_samples |
|---|---|---|
| Text | 0.20 | 2 |
| Image | 0.45 | 2 |
Each cluster is assigned a human-readable name generated by running TF-IDF on the combined text content of the cluster's files, extracting the top-2 most distinctive terms.
Phase 3: Human-in-the-Loop (src/hitl/)
Coming soon. This phase will present the proposed folder structure for interactive review before files are moved.
Configuration
Currently, model parameters are configured in-code:
| Parameter | Location | Default | Description |
|---|---|---|---|
chunkSize |
fileHash.py |
4194304 (4 MB) |
Byte chunk size for SHA-256 hashing |
txtEps |
dbscanModel.py |
0.20 |
DBSCAN epsilon for text clusters |
txtMinPts |
dbscanModel.py |
2 |
DBSCAN minimum points for text clusters |
imgEps |
dbscanModel.py |
0.45 |
DBSCAN epsilon for image clusters |
imgMinPts |
dbscanModel.py |
2 |
DBSCAN minimum points for image clusters |
top_n_words |
dirNaming.py |
2 |
Number of TF-IDF terms used in folder names |
PageSize |
dbscanModel.py |
30 |
Qdrant scroll page size |
Roadmap
- Recursive directory traversal
- SHA-256 chunked deduplication
- Multi-format content extraction (text, PDF, images)
- Local text & image embedding generation
- Qdrant vector storage
- DBSCAN semantic clustering
- TF-IDF auto-naming for directories
- Typer CLI with Rich output
- Human-in-the-Loop interactive review
- Config file support (YAML/TOML)
- Cross-modal clustering (text + image in unified space)
- Undo / dry-run mode
- Progress bars and summary statistics
- Additional file format support (
.docx,.xlsx,.pptx)
Contributing
Contributions are welcome! Here's how to get started:
- Fork the repository.
- Create a feature branch โ
git checkout -b feature/your-feature - Commit your changes โ
git commit -m "Add your feature" - Push to the branch โ
git push origin feature/your-feature - Open a Pull Request.
Please ensure your code follows the existing style and includes appropriate documentation.
License
This project is licensed under the MIT License โ see the LICENSE file for details.
Built with โค๏ธ for anyone drowning in unorganized files.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vortiq-0.1.0.tar.gz.
File metadata
- Download URL: vortiq-0.1.0.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
faa3610942c2624ca04c659d4ae7e3b0982dc7cac4b956c0aef12b93deabd7c2
|
|
| MD5 |
77beaa22eae20a0a72fe83ef2e5af383
|
|
| BLAKE2b-256 |
48f4a9466deecce8a41394adaf12eeb2da332301c9f43c6b76cc8b505d525713
|
File details
Details for the file vortiq-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vortiq-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
033086b54920483b7ed15043c2b14a531c36e3fa2fd61409a0c4dcadef7f99cd
|
|
| MD5 |
b7697d3b7193422d5bb823389fb22488
|
|
| BLAKE2b-256 |
862c798104b311cdc8e6f6a4891bdb1736ffe379830b1718b21646a978ca5841
|