Skip to main content

A lightning-fast, zero-cloud CLI utility that cleans your Windows directories by parsing file contents and visual data โ€” automatically destroying clones and generating intelligently named folder structures.

Project description

๐ŸŒ€ Vortex

A lightning-fast, zero-cloud CLI utility that cleans your Windows directories by parsing file contents and visual data โ€” automatically destroying clones and generating intelligently named folder structures.

Python uv scikit-learn Qdrant Windows Version License


Table of Contents


Overview

Vortex uses state-of-the-art embedding models to intelligently group your files. Whether you have text documents, code, PDFs, or images โ€” it semantically analyzes their content entirely locally. Your data never leaves your machine.

The Problem: Over time, directories accumulate duplicate files and an unstructured mess of documents, images, and code. Manually organizing them is tedious and error-prone.

The Solution: Vortex automates the entire process in a single command โ€” deduplicating exact clones via SHA-256, then semantically clustering the remaining files using vector embeddings and DBSCAN, and finally reorganizing them into auto-named folders derived from their content.


Features

๐Ÿ” Intelligent Deduplication

  • Recursive Directory Traversal โ€” Walks the entire target directory tree, resolving all nested files.
  • SHA-256 Chunked Hashing โ€” Identifies exact binary duplicates efficiently, even for very large files, using 4 MB chunked reads.
  • Storage Statistics โ€” Gathers file size metadata before acting.

๐Ÿง  Semantic Clustering

  • Multi-Format Content Extraction โ€” Pulls text from plain text files, code, PDFs (via PyMuPDF), and images (via Tesseract OCR).
  • Local Embedding Generation โ€” Generates text embeddings with BAAI/bge-small-en-v1.5 and image embeddings with Qdrant/clip-ViT-B-32-vision, all via fastembed.
  • Vector Database โ€” Manages semantic indexes in a local Qdrant instance (on-disk, no server needed).
  • DBSCAN Clustering โ€” Groups files by semantic proximity using cosine-distance DBSCAN with tuned eps and min_samples per modality.
  • TF-IDF Directory Naming โ€” Automatically generates human-readable folder names by extracting the most relevant terms from each cluster's combined content.

๐Ÿ›ก๏ธ Human-in-the-Loop (HITL)

Implementation in progress.

An interactive review phase that will let you verify the proposed folder structure and directory names before Vortex commits changes to your filesystem.


Architecture

flowchart TD
    A["๐Ÿ—‚๏ธ Target Directory"] --> B["Phase 1: Deduplication"]
    
    subgraph DEDUP ["src/dedup"]
        B --> B1["dirTraversal โ€” Recursive file discovery"]
        B1 --> B2["fileHash โ€” SHA-256 chunked hashing"]
        B2 --> B3["fileStats โ€” File size metadata"]
        B3 --> B4{"Duplicates found?"}
        B4 -- Yes --> B5["Remove exact clones"]
        B4 -- No --> C
        B5 --> C
    end
    
    C["Unique Files"] --> D["Phase 2: Clustering"]
    
    subgraph CLUSTER ["src/clustering"]
        D --> D1["fileContent โ€” Multi-format text extraction"]
        D1 --> D2["embeddings โ€” Vector generation"]
        D2 --> D2a["Text: BAAI/bge-small-en-v1.5 โ†’ 384d"]
        D2 --> D2b["Image: Qdrant/clip-ViT-B-32-vision โ†’ 512d"]
        D2a --> D3["Qdrant Vector DB โ€” Local on-disk storage"]
        D2b --> D3
        D3 --> D4["dbscanModel โ€” Cosine DBSCAN clustering"]
        D4 --> D5["dirNaming โ€” TF-IDF folder name generation"]
    end
    
    D5 --> E["Phase 3: HITL Review"]
    E --> F["๐Ÿ“ Organized Directory"]

Project Structure

Vortex/
โ”œโ”€โ”€ main.py                          # CLI entry point (Typer + Rich)
โ”œโ”€โ”€ pyproject.toml                   # Project metadata & dependencies
โ”œโ”€โ”€ uv.lock                         # Locked dependency versions
โ”œโ”€โ”€ .python-version                  # Python 3.12
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ dedup/                       # Phase 1 โ€” Deduplication
โ”‚   โ”‚   โ”œโ”€โ”€ dirTraversal.py          # Recursive directory walker
โ”‚   โ”‚   โ”œโ”€โ”€ fileHash.py              # SHA-256 chunked file hashing
โ”‚   โ”‚   โ””โ”€โ”€ fileStats.py             # File size statistics
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ clustering/                  # Phase 2 โ€” Semantic Clustering
โ”‚   โ”‚   โ”œโ”€โ”€ fileContent.py           # Multi-format content extraction
โ”‚   โ”‚   โ”œโ”€โ”€ embeddings.py            # Text & image embedding generation + Qdrant storage
โ”‚   โ”‚   โ”œโ”€โ”€ dbscanModel.py           # DBSCAN clustering with data retrieval from Qdrant
โ”‚   โ”‚   โ””โ”€โ”€ dirNaming.py             # TF-IDF based directory name generation
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ hitl/                        # Phase 3 โ€” Human-in-the-Loop (WIP)
โ”‚
โ””โ”€โ”€ docs/                            # Documentation (WIP)

Tech Stack

Category Technology Purpose
Language Python โ‰ฅ 3.12 Core runtime
CLI Framework Typer + Rich Command-line interface with styled output
Text Embeddings FastEmbed (BAAI/bge-small-en-v1.5) 384-dim text vectors
Image Embeddings FastEmbed (Qdrant/clip-ViT-B-32-vision) 512-dim image vectors
Vector Database Qdrant (local on-disk mode) Semantic index storage
Clustering scikit-learn (DBSCAN) Density-based grouping
PDF Parsing PyMuPDF Text extraction from PDFs
OCR Tesseract + pytesseract Image-to-text extraction
Image Processing Pillow Image loading for OCR
TF-IDF scikit-learn (TfidfVectorizer) Directory name generation
Package Manager uv Fast dependency management

Supported File Formats

Category Extensions Extraction Method
Plain Text / Code .txt, .md, .csv, .json, .py, .js, .html Direct UTF-8 read
Documents .pdf PyMuPDF text extraction
Images .png, .jpg, .jpeg CLIP embeddings (visual) + Tesseract OCR (textual)

Note: Images are embedded using CLIP for visual similarity clustering. OCR is used separately when text content is needed (e.g., for directory naming).


Prerequisites

  1. Python 3.12+ โ€” Download
  2. uv (recommended) โ€” Install
  3. Tesseract OCR โ€” Required for image text extraction.
    • Download the Windows installer from UB Mannheim.
    • Ensure tesseract.exe is accessible via your system PATH.

Installation

# Clone the repository
git clone https://github.com/44ompatil/Vortex.git
cd Vortex

# Install dependencies (uv recommended)
uv sync

# Or, using pip
pip install -e .

Usage

Vortex exposes a sort command that orchestrates the full pipeline:

# Sort a directory
uv run python main.py sort <target-directory>

# Example
uv run python main.py sort "C:\Users\you\Downloads"

Available Commands

Command Description
sort <directory> Run the full dedup โ†’ cluster โ†’ organize pipeline on the target directory
help Display available commands and usage information

What Happens When You Run sort

  1. Scan โ€” Recursively discovers all files in the target directory.
  2. Dedup โ€” Identifies and removes exact binary duplicates (SHA-256).
  3. Extract โ€” Pulls text/visual content from each unique file.
  4. Embed โ€” Generates vector embeddings (text: 384d, image: 512d).
  5. Cluster โ€” Groups semantically similar files using DBSCAN.
  6. Name โ€” Generates descriptive folder names via TF-IDF.
  7. Organize โ€” Moves files into their newly created, named folders.

How It Works

Phase 1: Deduplication (src/dedup/)

Files are recursively discovered via os.walk. Each file is hashed using SHA-256 with 4 MB chunked reads to handle large files efficiently. Files sharing the same hash are identified as exact duplicates โ€” only one copy is kept, the rest are deleted.

Phase 2: Semantic Clustering (src/clustering/)

Remaining unique files have their content extracted based on file type. Text content is embedded into 384-dimensional vectors using BAAI/bge-small-en-v1.5, while images are embedded into 512-dimensional vectors using Qdrant/clip-ViT-B-32-vision. All vectors are stored in a local Qdrant database.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) then groups files by semantic similarity using cosine distance. The algorithm parameters are tuned separately for each modality:

Modality eps min_samples
Text 0.20 2
Image 0.45 2

Each cluster is assigned a human-readable name generated by running TF-IDF on the combined text content of the cluster's files, extracting the top-2 most distinctive terms.

Phase 3: Human-in-the-Loop (src/hitl/)

Coming soon. This phase will present the proposed folder structure for interactive review before files are moved.


Configuration

Currently, model parameters are configured in-code:

Parameter Location Default Description
chunkSize fileHash.py 4194304 (4 MB) Byte chunk size for SHA-256 hashing
txtEps dbscanModel.py 0.20 DBSCAN epsilon for text clusters
txtMinPts dbscanModel.py 2 DBSCAN minimum points for text clusters
imgEps dbscanModel.py 0.45 DBSCAN epsilon for image clusters
imgMinPts dbscanModel.py 2 DBSCAN minimum points for image clusters
top_n_words dirNaming.py 2 Number of TF-IDF terms used in folder names
PageSize dbscanModel.py 30 Qdrant scroll page size

Roadmap

  • Recursive directory traversal
  • SHA-256 chunked deduplication
  • Multi-format content extraction (text, PDF, images)
  • Local text & image embedding generation
  • Qdrant vector storage
  • DBSCAN semantic clustering
  • TF-IDF auto-naming for directories
  • Typer CLI with Rich output
  • Human-in-the-Loop interactive review
  • Config file support (YAML/TOML)
  • Cross-modal clustering (text + image in unified space)
  • Undo / dry-run mode
  • Progress bars and summary statistics
  • Additional file format support (.docx, .xlsx, .pptx)

Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository.
  2. Create a feature branch โ€” git checkout -b feature/your-feature
  3. Commit your changes โ€” git commit -m "Add your feature"
  4. Push to the branch โ€” git push origin feature/your-feature
  5. Open a Pull Request.

Please ensure your code follows the existing style and includes appropriate documentation.


License

This project is licensed under the MIT License โ€” see the LICENSE file for details.


Built with โค๏ธ for anyone drowning in unorganized files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vortiq-0.1.0.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vortiq-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file vortiq-0.1.0.tar.gz.

File metadata

  • Download URL: vortiq-0.1.0.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vortiq-0.1.0.tar.gz
Algorithm Hash digest
SHA256 faa3610942c2624ca04c659d4ae7e3b0982dc7cac4b956c0aef12b93deabd7c2
MD5 77beaa22eae20a0a72fe83ef2e5af383
BLAKE2b-256 48f4a9466deecce8a41394adaf12eeb2da332301c9f43c6b76cc8b505d525713

See more details on using hashes here.

File details

Details for the file vortiq-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vortiq-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vortiq-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 033086b54920483b7ed15043c2b14a531c36e3fa2fd61409a0c4dcadef7f99cd
MD5 b7697d3b7193422d5bb823389fb22488
BLAKE2b-256 862c798104b311cdc8e6f6a4891bdb1736ffe379830b1718b21646a978ca5841

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page